Mastering Statistics: Fundamentals of Data Analysis.
  • Home
  • Math background
  • Probability
  • Exploratory Analysis
  • Supervised Learning
    • Statistical Tests
    • Linear Models
    • Assessing Model Accuracy
    • Classification problems
    • Titanic exercise
    • Polynomial Regression
    • Survival Analysis
    • Tree Based Methods
    • Support Vector Machines
    • Deep Learning with TensorFlow
    • Installing TensorFlow
    • Deep Learning
    • Deep Learning Lab with Torch
  • Unsupervised Learning
  • Matrices
    • Matrix Algebra
    • Matrices in Logistic Regression in Neural Networks

Contents

  • 1 Measuring the Quality of Fit. Mean Squared Error (MSE)
  • 2 Bias-Variance Trade-off
  • 3 R-squared
  • 4 Residual Standard Error (RSE)
  • 5 Multiple Linear regression.
  • 6 Testing error and training error
  • 7 Model fit
  • 8 Collinearity
  • 9 F-statistic.
  • 10 Deciding on the important variables.
    • 10.1 Best Subset selection
    • 10.2 Stepwise Selection
      • 10.2.1 Forward Stepwise selection
      • 10.2.2 Backward stepwise selection
  • 11 Estimating the test error
    • 11.1 Mallow’s Cp
    • 11.2 AIC
    • 11.3 Bayesian Information Criterion (BIC)
    • 11.4 Adjusted RSquared
  • 12 Resampling Methods
    • 12.1 Monte Carlo Method
    • 12.2 Cross-validation
      • 12.2.1 K-fold Cross-validation
      • 12.2.2 Potential problems with cross-validations.
    • 12.3 The bootstrap principle
    • 12.4 Use of Monte Carlo Simulation and Bootstrap
  • 13 Shrinkage Methods
  • 14 Dimension Reduction Methods
  • 15 Multiple hypothesis testing
    • 15.1 Family-Wise Error Rate (FWER)
    • 15.2 Bonferroni Method for Controlling FWER
    • 15.3 Holm’s Method for controlling FWER
    • 15.4 Benjamini-Hochberg Procedure to control
    • 15.5 Resampling approaches

Assessing Model Accuracy

  • Show All Code
  • Hide All Code

  • View Source

In this section, we discuss some of the most important concepts that arise in selecting a statistical learning procedure for a specific data set.

1 Measuring the Quality of Fit. Mean Squared Error (MSE)

In order to evaluate the performance of a statistical method on a given dataset, we need some way to measure how well its predictions actually match the observed data. In regression, the most commonly-used measure is the mean squared error (MSE)

\[ MSE = \frac{1}{n}\sum^n_{i=1}(y_i-\hat{f}(x_i))^2 =\frac{1}{n}\sum^n_{i=1}(y_i-\hat{y_i})^2 \tag{1}\]

that is the squared difference between the observed values and the predicted values divided by the number of data points.

If we remember the formula for the RSS we see that MSE formula can be written as:

\[ MSE = \frac{1}{n} RSS \]

The MSE is computed using the training data that was used to fit the model, and should be referred as training MSE, but we are more interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data. We want to choose the method that gives the lowest test MSE, as opposed to the lowest training MSE. There is no guarantee that the method with the lowest training MSE will also have the lowest test MSE. Roughly speaking, the problem is that many statistical methods specifically estimate coefficients so as to minimize the training set MSE. For these methods, the training set MSE can be small, but the test MSE is often much larger.

the graph above shows a linear regression line, and two polynomial lines one of them over fitted. If we calculate the Mean Square Error of each regression we can see that the over-fit model is not the best predictor over new data:

Code
# Function to calculate MSE
calculate_mse <- function(model, new_data) {
  mean((new_data$Y - predict(model, new_data))^2)
}

# Generate new training data
set.seed(456)
X <- seq(0, 10, length.out = 30)
Y <- 2 + 5 * X + X^2 + rnorm(30, sd = 5)
new_data <- data.frame(X, Y)

# Calculate MSE for each model
mse_linear <- calculate_mse(linear_model, new_data)
mse_parametric <- calculate_mse(parametric_model, new_data)
mse_overfitted <- calculate_mse(overfitted_model, new_data)

# Create a dataframe for MSE values
mse_data <- data.frame(
  Model = c("Linear", "Polynomial", "Overfitted"),
  MSE = c(mse_linear, mse_parametric, mse_overfitted)
)

# Plot MSE values
ggplot(mse_data, aes(x = Model, y = MSE, fill = Model)) +
  geom_bar(stat = "identity") +
  labs(title = "MSE for Different Models on New Training Data", x = "Model", y = "Mean Squared Error") +
  theme_minimal()

2 Bias-Variance Trade-off

Imagine you’re trying to build a model to predict house prices. You could use a simple model or a complex one. The bias-variance trade-off is all about balancing simplicity and complexity.

Bias is error introduced by assuming a model is too simple, capturing only a rough approximation of the real relationship. High bias means the model might miss important patterns (underfitting). Think of a straight line when a curve is more appropriate.

Variance is error introduced by the model being too complex, capturing noise instead of the actual pattern. High variance means the model is too sensitive to small fluctuations in the training data, leading to poor performance on new data (overfitting). Imagine fitting every little bump in the data.

The trade-off part comes in because as you try to decrease bias by making the model more complex, you often increase variance, and vice versa. The goal is to find the sweet spot where the model captures the essential patterns without overreacting to noise.

Low Bias, High Variance: A very flexible model that might fit the training data really well but fails on new data.

High Bias, Low Variance: A very simple model that doesn’t capture the data well, missing patterns.

Optimal Bias-Variance Trade-Off: A balanced model that captures the underlying trend without overfitting.

3 R-squared

We already talked about how to assess the slope of an individual predictor

The Residual Standard Error (RSE) provides an absolute measure of lack of fit of the model to the data. But since it is measured in the units of Y , it is not always clear what constitutes a good RSE. The \(R^2\) statistic provides an alternative measure of fit. It takes the form of a proportion —the proportion of variance explained— and so it always takes on a value between 0 and 1, and is independent of the scale of Y .

We already saw the residual sum of squared (RSS) RSS is a measure of the discrepancy between the actual data and the model’s predictions. It represents the sum of the squared differences between the observed values and the predicted values.

\[ RSS=\sum^n_{i=1}(y_1-\hat{y_1})^2 \] RSS measures the amount of variability that is left unexplained after performing the regression.

The total Sum of Squares (TSS): TSS is a measure of the total variability in the observed data. It represents the sum of the squared differences between the observed values and the mean of the observed values. TSS measures the total variance in the response Y and can be thourhg of as the amount of variability inherent in the response before the regression is performed.

\[ TSS=\sum^n_{i=1}(y_1-\bar{y_1})^2 \tag{2}\]

TSS - RSS measures the amount of variability in the response that is explained (or removed) by performing the regression, and \(R^2\) measures the proportion of variability in Y that can be explained using X.

\[ R^2 = \frac{TSS-RSS}{TSS}= 1-\frac{RSS}{TSS} \tag{3}\]

The R-squared value measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It’s calculated as:

A larger R-squared value indicates a better fit of the model to the data. An R-squared value closer to 1 means that the model explains a large proportion of the variance in the dependent variable, signifying a good fit. Conversely, a smaller R-squared value indicates that the model explains less of the variance, signifying a poorer fit, or the error variance is high, or both.

However, it can still be challenging to determine what is a good R2 value, and in general, this will depend on the application. For instance, in certain problems in physics, we may know that the data truly comes from a linear model with a small residual error. In this case, we would expect to see an R2 value that is extremely close to 1, and a substantially smaller R2 value might indicate a serious problem with the experiment in which the data were generated. On the other hand, in typical applications in biology, psychology, marketing, and other domains, the linear model is at best an extremely rough approximation to the data, and residual errors due to other unmeasured factors are often very large. In this setting, we would expect only a very small proportion of the variance in the response to be explained by the predictor, and an R2 value well below 0.1 might be more realistic.

The \(R^2\) statistic is a measure of the linear relationship between X and Y, and if we recall correlation, it also measures the linear relationship between X and Y, this suggest that we might be able to use correlation \(r-Cor(X,Y)\) instead of \(R^2\) in order to assess the fit of the model, and in fact, it can be shown that in the simple linear regression setting, \(R^2 = r^2\).

4 Residual Standard Error (RSE)

Another measure of the model fit is the Residual Standard Error (RSE) The RSE is an estimate of the standard deviation of the error term, representing the average amount that the observed values deviate from the model’s predictions. Essentially, it gives you an idea of the typical size of the residuals (errors) made by the model. It’s like a measure of how well the model fits the data.

\[ RSE = \sqrt{\frac{1}{n-p-1}RSS} \tag{4}\]

where \(n\) is the number of observations and \(p\) is the number of predictors or independent variables in the model. \(n-p-1\) is the degrees of freedom. A smaller RSE indicates a better fit, implying that the model’s predictions are close to the actual data. - RSS helps calculate RSE, giving you the scale of the errors. - TSS is used to calculate \(R^2\), which, together with RSE, helps you understand the overall model fit.

Both RSE and \(R^2\) provide insight into the model’s performance, but from slightly different angles. While \(R^2\) tells you how much variance in the dependent variable is explained by the model, RSE tells you about the typical size of the prediction errors.

By using these metrics together, you can get a comprehensive view of how well your regression model is performing

5 Multiple Linear regression.

When we have more than one predictor, the model will create a plane instead of a curve that tries to minimize the sum of squared errors:

Remember the formula for RSS for a simple linear model was explained already now we have a similar version including all the predictors:

\[ RSS=\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} \left( y_i - \hat{\beta_0} - \hat{\beta_1} x_{i1} - \hat{\beta_2} x_{i2}\dots... - - \hat{\beta_p} x_{ip}\right)^2 \tag{5}\]

This is done using statistical software. The values \(\hat{\beta_o} ,\hat{\beta_1},\dots ,\hat{\beta_p}\) that minimize RSS are the multiple least square regression coefficient estimates

The ideal scenario is when the predictors are uncorrelated: each coefficient can be estimated and tested separately, a unit change in \(X_j\) is associated with a \(\beta_j\) change in \(Y\) while all the other variables stay fixed.

Correlations amongst predictors cause problems: the variance of all coefficients tends to increase, sometimes dramatically and interpretations become hazardous, when \(X_j\) changes, everything else changes. We cannot attribute the change to a specific predictor. For example height and weight are two predictors that are correlated.

When looking at multiple regression models we have to ask ourselves some questions: - Is at least one of the predictors useful in predicting the response - Do all the predictors help to explain Y or only some of them are useful? - How well does the model fit the data?

6 Testing error and training error

The training error rate is the proportion of mistakes that are made if we apply our model to the training data-set.

The test error is the average error that results from using a statistical learning method to predict the response on a new observation, that is, a measurement that was not used in training the method.

The test error can be easily calculated if a designated test set is available. Unfortunately, this is usually not the case.In contrast, the training error can be easily calculated by applying the statistical learning method to the observations used in its training. But the training error rate often is quite different from the test error rate, and in particular the former can dramatically underestimate the latter.

7 Model fit

Two of the most common numerical measures of model fit are the RSE and \(R^2\), the fraction of variance explained. These quantities are computed and interpreted in the same fashion as for simple linear regression. Recall that in simple regression, \(R^2\) is the square of the correlation of the response and the variable. In multiple linear regression, it turns out that it equals \(Cor(Y, \hat{Y} )^2\), the square of the correlation between the response and the fitted linear model. It turns out that \(R^2\) will always increase when more variables are introduced, even if those variables are only weekly associated with the response. If we notice that the increase in \(R^2\) is minimal after adding another variable, it is probably best to leave that variable out of the model. We can check RSE in combination with \(R^2\), as RSE will increase despite RSS increasing as well if the new variable is not adding enough value. Remember the formula for RSE (Equation 4):

\[ RSE = \sqrt{\frac{1}{n-p-1}RSS} \]

Thus, models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p.

Example:

We run a campaign to sell a product, and we advertise in three mediums, newspapers, radio and tv. We have the amount spent in each of the mediums and the sales. We will first run a simple linear model over the three variables:

Code
Advertising <- readr::read_csv("data/Advertising.csv")
tv<- lm(sales ~ TV, data = Advertising)
summary(tv)

Call:
lm(formula = sales ~ TV, data = Advertising)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3860 -1.9545 -0.1913  2.0671  7.2124 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 7.032594   0.457843   15.36 <0.0000000000000002 ***
TV          0.047537   0.002691   17.67 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.259 on 198 degrees of freedom
Multiple R-squared:  0.6119,    Adjusted R-squared:  0.6099 
F-statistic: 312.1 on 1 and 198 DF,  p-value: < 0.00000000000000022
Code
radio<- lm(sales ~ radio, data = Advertising)
summary(radio)

Call:
lm(formula = sales ~ radio, data = Advertising)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.7305  -2.1324   0.7707   2.7775   8.1810 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  9.31164    0.56290  16.542 <0.0000000000000002 ***
radio        0.20250    0.02041   9.921 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.275 on 198 degrees of freedom
Multiple R-squared:  0.332, Adjusted R-squared:  0.3287 
F-statistic: 98.42 on 1 and 198 DF,  p-value: < 0.00000000000000022
Code
newspaper<- lm(sales ~ newspaper, data = Advertising)
summary(newspaper)

Call:
lm(formula = sales ~ newspaper, data = Advertising)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.2272  -3.3873  -0.8392   3.5059  12.7751 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) 12.35141    0.62142   19.88 < 0.0000000000000002 ***
newspaper    0.05469    0.01658    3.30              0.00115 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.092 on 198 degrees of freedom
Multiple R-squared:  0.05212,   Adjusted R-squared:  0.04733 
F-statistic: 10.89 on 1 and 198 DF,  p-value: 0.001148

We see that independently, all three variables have a \(p\)-value below or significance level.

If we fit the model to the three variables instead newspaper stops showing as significant.

Code
lm.fit<- lm(sales ~ TV+radio+newspaper, data = Advertising)
summary(lm.fit)

Call:
lm(formula = sales ~ TV + radio + newspaper, data = Advertising)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.8277 -0.8908  0.2418  1.1893  2.8292 

Coefficients:
             Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  2.938889   0.311908   9.422 <0.0000000000000002 ***
TV           0.045765   0.001395  32.809 <0.0000000000000002 ***
radio        0.188530   0.008611  21.893 <0.0000000000000002 ***
newspaper   -0.001037   0.005871  -0.177                0.86    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared:  0.8972,    Adjusted R-squared:  0.8956 
F-statistic: 570.3 on 3 and 196 DF,  p-value: < 0.00000000000000022

The Adjusted Rsquared is 0.8956 and the RSE 1.686

If we fit the model with only the two significant variables:

Code
lm.fit<- lm(sales ~ TV+radio, data = Advertising)
summary(lm.fit)

Call:
lm(formula = sales ~ TV + radio, data = Advertising)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.7977 -0.8752  0.2422  1.1708  2.8328 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  2.92110    0.29449   9.919 <0.0000000000000002 ***
TV           0.04575    0.00139  32.909 <0.0000000000000002 ***
radio        0.18799    0.00804  23.382 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.681 on 197 degrees of freedom
Multiple R-squared:  0.8972,    Adjusted R-squared:  0.8962 
F-statistic: 859.6 on 2 and 197 DF,  p-value: < 0.00000000000000022

The adjusted R-squared is 0.8962 instead of 0.8956 but the RSE has decreased from 1.686 in the model with three predictors to 1.681 in the simpler model. So, the model sales ~ TV + radio is more parsimonious and just as effective. Plus, it’s always good to avoid unnecessary complexity when you can.

Sometimes visualizing the data is helpful:

Code
# Calculate predicted values
Advertising$predicted_sales <- predict(lm.fit)

# Create 3D scatter plot
s3d <- scatterplot3d(Advertising$TV, Advertising$radio, Advertising$sales, 
                     pch = 16, highlight.3d = TRUE, type = "p", 
                     main = "3D Scatter Plot with Regression Plane",
                     xlab = "TV", ylab = "Radio", zlab = "Sales")

# Add regression plane
s3d$plane3d(lm.fit, col = "lightblue")

# Plot lines from data points to the regression plane
for (i in 1:nrow(Advertising)) {
  s3d$points3d(x = c(Advertising$TV[i], Advertising$TV[i]), 
               y = c(Advertising$radio[i], Advertising$radio[i]), 
               z = c(Advertising$sales[i], Advertising$predicted_sales[i]), 
               type = "l", col = "green")
}

8 Collinearity

Collinearity refers to the situation in which two or more predictor variables are closely related to one another. The presence of collinearity can pose problems in the regression context, since it can be difficult to separate out the individual effects of collinear variables on the response. Since collinearity reduces the accuracy of estimates of the regression coefficients, it causes the standard error to grow. Recall that the t-statistic for each predictor is calculated by dividing \(\hat{\beta_j}\) by its standard error. Consequently, collinearity results in a decline in the t-statistic. As a result, in the presence of collinearity, we may fail to reject the null hypothesis, this means that the probability of correctly detecting a non-zero coefficient is reduced. A simple way to detect collinearity between two variables is to look at the correlation matrix of the predictors, but it is also possible that there is collinearity between three variables, even if no pair of variables have a high correlation, and this will not be detected in the correlation matrix. For this we can calculate the variance inflaction factor (VIF) but this is out of the scope here.

9 F-statistic.

\[ F= \frac{TSS-RSS/p}{RSS/(n-p-1)} \tag{6}\]

The F-statistic in a multiple regression model assesses the overall significance of the model. It tells you whether your model’s variables are jointly significant in explaining the variation in the dependent variable. High F-statistic: Indicates that the group of predictors has a significant relationship with the dependent variable. The larger the F, the more likely the observed relationship is not due to random chance. Low F-statistic: Suggests that the group of predictors does not have a significant explanatory power.

Whether an F value is high or low is determined by comparing it to a critical value from the F-distribution table, which depends on the degrees of freedom and the chosen significance level (commonly 0.05). Steps to Determine High or Low F-value: - Degrees of Freedom: Calculate the degrees of freedom for the numerator (df1 = number of predictors) and the denominator (df2 = total number of observations - number of predictors - 1).

F-distribution Table: Use the degrees of freedom to find the critical F-value from the F-distribution table at your chosen significance level (usually 0.05).

Compare: If the calculated F-value from your regression output is greater than the critical F-value from the table, it means your predictors are collectively significant (high F-value).

Example: Suppose you have 3 predictors in your model and a sample size of 100. The degrees of freedom are:

df1 (numerator) = 3

df2 (denominator) = 100 - 3 - 1 = 96

Looking up the critical value for df1 = 3 and df2 = 96 at the 0.05 significance level in an F-table, you might find it to be around 2.70.

If your calculated F-value is, say, 5.2, since 5.2 > 2.70, your model is significant (high F-value).

If it were 1.8, since 1.8 < 2.70, your model isn’t significant (low F-value).

In r, the f-value will also come with its \(p\)-value, indicating its significance.

10 Deciding on the important variables.

With today’s big data there is more availability of data than ever, and it is becoming usual to have many many features to fit a model, and model accuracy can be challenging in those situations, specially when we have more features than number of samples (p>n). By removing irrelevant features we can obtain a model that is more easily interpreted. We will present some approaches for automatically performing feature selection.

  • Subset selection: We identify a subset of the \(p\) predictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables.

  • Shrinkage: We fit a model involving all \(p\) predictors, but the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage (also known as regularization) has the effect of reducing variance and can also perform variable selection.

  • Dimension reduction: We project the \(p\) predictors into a M-dimensional subspace, where M<p. This is achieved by computing M different linear combinations or projections, of the variables. Then these M projections are used as predictors to fit a linear regression model by least squares.

10.1 Best Subset selection

The most direct approach to answer this question is called all subsets or best subsets regression. We compute the least squares fit for all possible subsets and then choose between them based on some criterion that balances training error with model size. This gets really hard when the number of variables is high, the number of possible models is \(2^p\) so for example for p=40 there are over a billion models. Instead, we need an automated approach that searches through a subset of them. We discuss two commonly use approaches next.

Although we are presenting this here in the context of linear models, the same ideas apply to other types of models, such as logistic regression. One difference will be that other models will work with a similar metric to the residual sum of squares that is called deviance.

A limitation of the subset selection method is the number of p. Most statistical packages have a limit of 30 to 40 features. Another problem of this method with high number of \(p\) is overfitting, so this method is only valid when \(p\) is small (<10).

We start with the null model that is a model that contains the intercept but no predictors. The intercept is the mean of \(y\) Now you start adding variables one at a time: for each variable you fit \(p\) simple linear models each with one of the variables and the intercept, and you look at each of them. After that you choose the variable that results in the lowest RSS (that is the largest \(R^2\)). Now, having picked that, you fix that variable in the model and now you add one by one all the variables to this model again, and choose the one that best improve the residual sum of squares.

You can continue like this until some stopping rule is satisfied, for example, when all the remaining variables have a \(p\) value above some threshold.

Once we have a correct combination of predictors, to select if the best model is the model with 2,3,or n predictors, we cannot use RSS anymore because naturally RSS will decrease as we include more predictors, but that does not necessarily means it’s a best model. We need to select the model that will have the smallest test error, and not the one with the smallest training error. Remember we talked about these concepts already. Frequently for this we use cross-validation

Example

Here we apply the best subset selection approach to the Hitters data from ISLR2 library. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year. Note that the Salary variable is missing in some rows, so we are going to omit them

Code
#nubmer of nas. 
sum(is.na(Hitters$Salary))
[1] 59
Code
#remove nas
Hitters<- na.omit(Hitters)
sum(is.na(Hitters$Salary))
[1] 0

The regsubsets() function (part of the leaps library) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS. The syntax is the same as for ‘lm()’. The ‘summary()’ command outputs the best set of variables for each model size.

Code
regfit.full <- leaps::regsubsets(Salary ~ ., Hitters)
(reg.summary<- summary(regfit.full))
Subset selection object
Call: regsubsets.formula(Salary ~ ., Hitters)
19 Variables  (and intercept)
           Forced in Forced out
AtBat          FALSE      FALSE
Hits           FALSE      FALSE
HmRun          FALSE      FALSE
Runs           FALSE      FALSE
RBI            FALSE      FALSE
Walks          FALSE      FALSE
Years          FALSE      FALSE
CAtBat         FALSE      FALSE
CHits          FALSE      FALSE
CHmRun         FALSE      FALSE
CRuns          FALSE      FALSE
CRBI           FALSE      FALSE
CWalks         FALSE      FALSE
LeagueN        FALSE      FALSE
DivisionW      FALSE      FALSE
PutOuts        FALSE      FALSE
Assists        FALSE      FALSE
Errors         FALSE      FALSE
NewLeagueN     FALSE      FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
         AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "   " " 
8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"   " " 
         CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
2  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
3  ( 1 ) " "    " "     " "       "*"     " "     " "    " "       
4  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
5  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
6  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
7  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
8  ( 1 ) "*"    " "     "*"       "*"     " "     " "    " "       

For each subset size (number of features 1…n) it returns the best subset models. An asterisk indicates that a given variable is included in the corresponding model. For instance, this output indicates that the best two-variable model contains only Hits and CRBI.

By default, regsubsets() only reports results up to the best eight-variable model. But the nvmax option can be used in order to return as many variables as are desired.

Code
regfit.full <- regsubsets(Salary ~ ., data = Hitters,
    nvmax = 19)
(reg.summary <- summary(regfit.full))
Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19)
19 Variables  (and intercept)
           Forced in Forced out
AtBat          FALSE      FALSE
Hits           FALSE      FALSE
HmRun          FALSE      FALSE
Runs           FALSE      FALSE
RBI            FALSE      FALSE
Walks          FALSE      FALSE
Years          FALSE      FALSE
CAtBat         FALSE      FALSE
CHits          FALSE      FALSE
CHmRun         FALSE      FALSE
CRuns          FALSE      FALSE
CRBI           FALSE      FALSE
CWalks         FALSE      FALSE
LeagueN        FALSE      FALSE
DivisionW      FALSE      FALSE
PutOuts        FALSE      FALSE
Assists        FALSE      FALSE
Errors         FALSE      FALSE
NewLeagueN     FALSE      FALSE
1 subsets of each size up to 19
Selection Algorithm: exhaustive
          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
5  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
7  ( 1 )  " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "   " " 
8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"   " " 
9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"   "*" 
16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"   "*" 
19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"   "*" 
          CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
2  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
3  ( 1 )  " "    " "     " "       "*"     " "     " "    " "       
4  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
5  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
6  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
7  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
8  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
9  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
10  ( 1 ) "*"    " "     "*"       "*"     "*"     " "    " "       
11  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
12  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
13  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
14  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
15  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
16  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
17  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
18  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
19  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       

We can see the R squared using reg.summary$rsq. Here we can see that the R2 statistic increases from 32%, when only one variable is included in the model, to almost 55%, when all variables are included. As expected, the R2 statistic increases monotonically as more variables are included, so this is not a good statistic for choosing the right size for the model because it will always recommend the model with more variables.

Code
reg.summary$rsq
 [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
 [8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
[15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159

So now that we have the best combination of variables for each model size, we can use cross validation or other methods to select the model with the best number of predictors.

We will see study Mallow’s Cp, AIC or BIC soon, and regsubset gives us the statistics for the different models for those methods:

Code
par(mfrow = c(2, 2))
plot(reg.summary$bic, xlab = "Number of Variables",
    ylab = "BIC", type = "l")
plot(reg.summary$cp, xlab = "Number of Variables",
    ylab = "Cp", type = "l")
plot(reg.summary$rsq, xlab = "Number of Variables",
    ylab = "RSS", type = "l")
plot(reg.summary$rss, xlab = "Number of Variables",
    ylab = "rss", type = "l")

with this information we will be able to select the model with the correct number of variables.

10.2 Stepwise Selection

10.2.1 Forward Stepwise selection

The stepwise selection is very similar to the best subset as we start with a model with only the intercept and start aggregating predictors and selecting those with less RSS. The difference with best subset selection is that these models are nested and brings the number of models that we are considering from \(2^p\) to \(p^2\) which is more manageable. Because we are not considering every possible combination of all the predictors, this method is not guaranteed to find the model with less RSS.

Again, once we have the best version of the model with each number of predictors, we will choose among them using cross-validation or a similar technique.

10.2.2 Backward stepwise selection

It runs a similar process but starting with a model with all the variables, then you run the model removing each variable and see which one is the least statistically significant one. The new (p-1) variable model is fit, and the variable with the largest \(p\)-value is removed. Continue until a stopping rule is reached, for instance, we may stop when all remaining variables have a significant \(p\)-value defined by some significance threshold.

One limitation of the backward stepwise selection is that we can only do it with n>p (more samples than predictors). This limitation does not apply to forward stepwise selection.

Example

We can also use the regsubsets() function to perform forward stepwise or backward stepwise selection, using the argument method = "forward" or method="backward". We are going to format a bit the output of the summary so it is more friendly to the reader:

Code
regfit.fwd <- regsubsets(Salary ~ ., data = Hitters,
    nvmax = 19, method = "forward")

regfit.bwd <- regsubsets(Salary ~ ., data = Hitters,
    nvmax = 19, method = "backward")

createSummaryTable <- function(regfitObject){

summary <- summary(regfitObject)
matrix_of_models <- as.data.frame(summary$which)
matrix_of_models <- matrix_of_models %>%
  mutate_all(~ifelse(., "**", ""))

kable(matrix_of_models, escape = FALSE) %>%
  kable_styling(full_width = FALSE) %>%
  row_spec(0, bold = TRUE) 
}

createSummaryTable(regfit.fwd)
(Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
** **
** ** **
** ** ** **
** ** ** ** **
** ** ** ** ** **
** ** ** ** ** ** **
** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **
Code
createSummaryTable(regfit.bwd)
(Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
** **
** ** **
** ** ** **
** ** ** ** **
** ** ** ** ** **
** ** ** ** ** ** **
** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **

For instance, we see that using forwards stepwise selection, the best one-variable model contains only CRBI, and the best two-variable model additionally includes Hits. For this data, the best one-variable through six-variable models are each identical for best subset and forward selection. However, the best model for our number of variabless identified by forward and backward stepwise selection and best subset selection are different.

Code
coef(regfit.full, 7)
 (Intercept)         Hits        Walks       CAtBat        CHits       CHmRun 
  79.4509472    1.2833513    3.2274264   -0.3752350    1.4957073    1.4420538 
   DivisionW      PutOuts 
-129.9866432    0.2366813 
Code
coef(regfit.fwd, 7)
 (Intercept)        AtBat         Hits        Walks         CRBI       CWalks 
 109.7873062   -1.9588851    7.4498772    4.9131401    0.8537622   -0.3053070 
   DivisionW      PutOuts 
-127.1223928    0.2533404 
Code
coef(regfit.bwd, 7)
 (Intercept)        AtBat         Hits        Walks        CRuns       CWalks 
 105.6487488   -1.9762838    6.7574914    6.0558691    1.1293095   -0.7163346 
   DivisionW      PutOuts 
-116.1692169    0.3028847 

As we saw before, the summary() function also returns \(R^2\), RSS, adjusted \(R^2\), Cp, and BIC. We can examine these to try to select the best overall model.

Code
reg.summary <- summary(regfit.fwd)
names(reg.summary)
[1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"   

Plotting RSS, adjusted R Squared, Cp and BIC for all the models at once will help us decide which model to select. Note the type ="1" option tells R to connect the plotted points with lines. Here we are plotting RSS and Adjusted R Squared

Code
par(mfrow = c(1, 2))
plot(reg.summary$rss, xlab = "Number of Variables",
    ylab = "RSS", type = "l")
plot(reg.summary$adjr2, xlab = "Number of Variables",
    ylab = "Adjusted RSq", type = "l")

The points() command works like the plot() command, except that it puts points on a plot that has already been created, instead of creating a new plot. The which.max() function can be used to identify the location of the maximum point of a vector. We will now plot a red dot to indicate the model with the largest adjusted R2 statistic, which would be the preferred one.

Code
max<- which.max(reg.summary$adjr2)
plot(reg.summary$adjr2, xlab = "Number of Variables",
    ylab = "Adjusted RSq", type = "l")
points(max, reg.summary$adjr2[max], col = "red", cex = 2, 
    pch = 20)

Similarly we can plot the Cp and the BIC statistics and indicate the models with the smallest statistic:

Code
min<- which.min(reg.summary$cp)
plot(reg.summary$cp, xlab = "Number of Variables",
    ylab = "Cp", type = "l")
points(min,reg.summary$cp[min], col = "red", cex = 2, 
    pch = 20)

Code
min<- which.min(reg.summary$bic)
plot(reg.summary$bic, xlab = "Number of Variables",
    ylab = "BIC", type = "l")
points(min, reg.summary$bic[min], col = "red", cex = 2,
    pch = 20)

The ‘regsubsets()’ function has a built-in ‘plot()’ command which can be used to display the selected variables for the best model with a given number of predictors, ranked according to the BIC, Cp, adjusted R2, or AIC. To find out more about this function, type ?plot.regsubsets. We will see the BIC here:

Code
plot(regfit.fwd, scale = "bic")

The top row of the plot contains a black square for each variable selected according to the optimal model associated with that statistic. For instance, we see that several models share a BIC close to −150. However, the model with the lowest BIC is the six-variable model that contains only AtBat, Hits, Walks, CRBI, DivisionW, and PutOuts. We can use the coef() function to see the coefficient estimates associated with this model.

Code
coef(regfit.fwd, 6)
 (Intercept)        AtBat         Hits        Walks         CRBI    DivisionW 
  91.5117981   -1.8685892    7.6043976    3.6976468    0.6430169 -122.9515338 
     PutOuts 
   0.2643076 

As we did before we can plot the results for Cp and and see the difference between our bakwards and forward step wise selections:

Code
plot(regfit.fwd, scale="Cp")

Code
plot(regfit.bwd, scale="Cp")

11 Estimating the test error

  • We can indirectly estimate the test error by making an adjustment to the training error to account for the bias due to overfitting.

  • We can directly estimate the test error, using validation set approach or a cross-validation approach (see chapter Resampling Methods). These involve fitting the model in part of the dataset and testing in with unseen data.

We are going to see here other methods:

11.1 Mallow’s Cp

This technique adjust the training error for the model size, and can be used to select among a set of models with different numbers of variables. \[ C_p = \frac{1}{n}(RSS+2d\hat{\sigma}^2) \tag{7}\]

Where d is the total number of parameters used and - \(\hat{\sigma}^2\) is an estimate of the variance of the error term \(\epsilon\) associated with each response measurement, calculated as: \[ \hat{\sigma}^2 = \frac{SSE}{n-p} \tag{8}\] SSE is the Sum of Squared Errors or Residual Sum of Squares

\[ SSE = \sum^n_{i=1}(y_i-\hat{y_i})^2 \tag{9}\]

This technique is restricted to those models where p<n. Example using the mtcars dataset

Code
# Fit the model
model <- lm(mpg ~ hp + wt, data = mtcars)

# Extract the residuals
residuals <- model$residuals

# Calculate SSE
SSE <- sum(residuals^2)

# Number of observations
n <- length(mtcars$mpg)

# Number of predictors (including intercept)
p <- length(coefficients(model))

# Calculate \(\hat{\sigma}^2\)
sigma_hat_sq <- SSE / (n - p)

# Print the result
print(sigma_hat_sq)
[1] 6.725785

Now we calculate Mallow’s Cp for the same dataset for a model with with two parameters:

Code
# Fit the subset model
subset_model <- lm(mpg ~ hp + wt, data = mtcars)

# Calculate residuals and SSE for the subset model
residuals_subset <- subset_model$residuals
SSE_subset <- sum(residuals_subset^2)

# Number of predictors in the subset model (including intercept)
p_subset <- length(coefficients(subset_model))
print(SSE_subset)
[1] 195.0478
Code
# Calculate Mallows' Cp 
Cp <- (SSE_subset / sigma_hat_sq) + 2 * p_subset - n 
print(Cp)
[1] 3

When comparing two values for different models we want to choose the smallest Cp.

11.2 AIC

The criterion is defined for a large class of models fit by maximum likelihood. We are not going to look at the formula right now, but it is similar to the previous and it is actually proportional to Cp, so again, we will choose the model with the smallest value.

11.3 Bayesian Information Criterion (BIC)

\[ BIC = \frac{1}{n}(RSS+log(n)d\hat{\sigma}^2) \tag{10}\]

We also look for the smallest value of the BIC for our models. The BIC statistic generally places a heavier penalty on models with many variables than the two methods seen before.

Because all these three methods rely on calculating hat sigma squared, all of them require that n>p due to the formula to calculate it (Equation 8)

11.4 Adjusted RSquared

For a least squared model with \(d\) variables, the adjusted \(R^2\) statistic is calculated as:

\[ \text{Adjusted } R^2= 1- \frac{RSS/(n-d-1)}{TSS/(n-1)} \tag{11}\]

where TSS is the total sum of squares.

Unlike Cp, AIC and BIC, for which a small value indicates a model with low test error, a large value of adjusted R squared indicates a model with small test error. Maximizing the adjusted \(R^2\) is equivalent to minimizing RSS/(n-d-1). While RSS always decreases as the number of variables in the model increases, the presence of d in the denominator corrects for that, so the inclusion of unnecessary variables in the model pays a price.

The drawback of Adjusted Rsquared is that it cannot be used for other models, like logistic, for example.

Example: Validation Set Approach:

We saw how to choose our best models for the Hitters dataset using Best subset selection, Stepwise Selection and then using methods like BIC, AIC to select the best size for our model.

Now we will consider how to do this using the validation set and cross-validation approaches. In order for these approaches to yield accurate estimates of the test error, we must use only the training observations to perform all aspects of the model-fitting, including variable selection. If the full data set is used to perform the best subset selection step, the validation set errors and cross-validation errors that we obtain will not be accurate estimates of the test error. In order to use the validation approach, we begin by splitting the observations into a training set and a test set. We choose the size of the train set as 180 (approx 2/3 of the data)

Code
objects_to_remove <- ls(pattern = "^regfit") 
rm(list = objects_to_remove)
n<-nrow(Hitters)
set.seed(1)
train <- sample(seq(1:n), 180, replace= FALSE)

now we apply regsubsets() to the training set in order to perform best subset selection.

Code
regfit.fwd<- regsubsets(Salary ~ ., data = Hitters[train, ], nvmax= 19, method="forward")
createSummaryTable(regfit.fwd)
(Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
** **
** ** **
** ** ** **
** ** ** ** **
** ** ** ** ** **
** ** ** ** ** ** **
** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **

We now make a model matrix from the test data:

Code
test.mat <- model.matrix(Salary ~ ., data = Hitters[-train, ])
head(test.mat)
                  (Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat
-Andre Dawson               1   496  141    20   65  78    37    11   5628
-Andres Galarraga           1   321   87    10   39  42    30     2    396
-Al Newman                  1   185   37     1   23   8    21     2    214
-Argenis Salazar            1   298   73     0   24  24     7     3    509
-Andres Thomas              1   323   81     6   26  32     8     2    341
-Alex Trevino               1   202   53     4   31  26    27     9   1876
                  CHits CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts
-Andre Dawson      1575    225   828  838    354       1         0     200
-Andres Galarraga   101     12    48   46     33       1         0     805
-Al Newman           42      1    30    9     24       1         0      76
-Argenis Salazar    108      0    41   37     12       0         1     121
-Andres Thomas       86      6    32   34      8       1         1     143
-Alex Trevino       467     15   192  186    161       1         1     304
                  Assists Errors NewLeagueN
-Andre Dawson          11      3          1
-Andres Galarraga      40      4          1
-Al Newman            127      7          0
-Argenis Salazar      283      9          0
-Andres Thomas        290     19          1
-Alex Trevino          45     11          1

The model.matrix() function is used in many regression packages for building an X matrix from data. Now we run a loop, and for each size i, we extract the coefficients from regfit.fwd for the best model that size, multiply them into the appropriate columns of the test model matrix to form the predictions, and compute the test Mean Square for Error (MSE).

First we will do a sample iteration showing some of the the values so we can see what the loop is doing we choose the model with three variables + the intercept:

Code
(coefi <- coef(regfit.fwd, id = 3))
 (Intercept)        Walks         CRBI    DivisionW 
 116.3421387    6.7810155    0.7516748 -177.8700862 
Code
 print(names(coefi))
[1] "(Intercept)" "Walks"       "CRBI"        "DivisionW"  
Code
 head(test.mat[, names(coefi)])
                  (Intercept) Walks CRBI DivisionW
-Andre Dawson               1    37  838         0
-Andres Galarraga           1    30   46         0
-Al Newman                  1    21    9         0
-Argenis Salazar            1     7   37         1
-Andres Thomas              1     8   34         1
-Alex Trevino               1    27  186         1
Code
 pred <- test.mat[, names(coefi)] %*% coefi
 head(pred)
                       [,1]
-Andre Dawson     997.14318
-Andres Galarraga 354.34964
-Al Newman        265.50854
-Argenis Salazar   13.75113
-Andres Thomas     18.27712
-Alex Trevino     261.37098
Code
 salaryTestData<- Hitters$Salary[-train]
 meanSquareError<- mean((salaryTestData - pred)^2)

Let’s run the loop now:

Code
val.errors <- rep(NA, 19)
for (i in 1:19){
  coefi <- coef(regfit.fwd, id = i)
  pred <- test.mat[, names(coefi)] %*% coefi
  val.errors[i] <- mean((Hitters$Salary[-train] - pred)^2)
}

we can now plot the square root of MSE to see the results. Taking the square root of MSE gives us the Root Mean Squared Error (RMSE), which is on the same scale as the original data. This makes it easier to interpret and compare. The model summary does not give us MSE directly, but we can calculate it: \(MSE=RSS/n\). In our code: regfit.fwd$rss[-1]/length(Hitters$Salary[test]) the minus one is to remove the intercept.

Code
plot(sqrt(val.errors), ylab= "Sqrt MSE", pch=1,type="b",ylim= c(250,450))
points(which.min(val.errors), sqrt(val.errors[which.min(val.errors)]), col="red", cex = 2,
    pch = 20, )

points(sqrt(regfit.fwd$rss[-1]/length(Hitters$Salary[train])),col="blue", pch =1, type="b")
legend("topright", legend=c("Training","Validation"), col= c("blue","black"), pch=1)

the best model is the one with the min error:

Code
val.errors
 [1] 168475.6 155752.3 155117.6 143942.4 133666.6 123143.9 123647.3 129815.6
 [9] 134533.1 134622.5 136185.3 141050.9 139742.9 139190.8 140488.9 135309.0
[17] 132339.1 134276.7 135389.4
Code
min<-which.min(val.errors)
coef(regfit.fwd, min)
 (Intercept)        AtBat         Hits        Walks         CRBI    DivisionW 
 100.1066984   -1.2667362    4.9577691    4.5588605    0.7525976 -160.7461336 
     PutOuts 
   0.2695514 

This was a little tedious because there is no predict method for regsubsets(). Since we will be using this function again, we can write our own method:

Code
predict.regsubsets <- function(object, newdata, id, ...) {
  form <- as.formula(object$call[[2]]) #extract call formula
  mat <- model.matrix(form, newdata) 
  coefi <- coef(object, id = id)
  xvars <- names(coefi)
  mat[, xvars] %*% coefi
 }

Our function pretty much mimics what we did above. The only complex part is how we extracted the formula used in the call to regsubsets(). We demonstrate how we use this function below, when we do cross-validation.

Finally, we perform best subset selection on the full data set, and select the best model for our number of variables. It is important that we make use of the full data set in order to obtain more accurate coefficient estimates. Note that we perform best subset selection on the full data set and select the best model for our number of variables, rather than simply using the variables that were obtained from the training set, because the best model for our number of variables on the full data set may differ from the corresponding model on the training set.

In fact, we see that the best model for our number of variables on the full data set has a different set of variables than the best model for our number of variables on the training set.

Code
regfit.best <- regsubsets(Salary ~ ., data = Hitters,
    nvmax = 19)
coef(regfit.best, min)
 (Intercept)        AtBat         Hits        Walks         CRBI    DivisionW 
  91.5117981   -1.8685892    7.6043976    3.6976468    0.6430169 -122.9515338 
     PutOuts 
   0.2643076 

12 Resampling Methods

12.1 Monte Carlo Method

The Monte Carlo method is a powerful statistical technique used to understand the impact of risk and uncertainty in prediction and forecasting models.

If for example we want to know the average height of all people living in USA, we estimate the parameter we are interested in (average height of the population) using a statistic estimator that is the average height of a sample of the population drawn at random. \(\theta\) is the parameter we want to calculate and we use \(\hat{\theta}\) that is the average of our sample. By the law of large numbers, the approximation error can be made arbitrarily small by using a large enough sample size.

What if we are interested in an estimator \(\hat{\theta}\) for some parameter \(\theta\) of the population and the normal approximation is not valid for that estimator? In such situations, simulations can often be used to estimate the parameters.

The Monte Carlo method relies on repeated random sampling to obtain numerical results. The core idea is to use randomness to solve problems that might be deterministic in nature. This method is named after the Monte Carlo Casino in Monaco, reflecting the element of chance and randomness.

In our example to estimate the height of the population of USA adults we get many (let’s say 1000) samples of n=100 observations. We then compute the estimator \(\hat{\theta}\) for each sample (in our case the average height), resulting in 1000 estimates. Now we compute the standard deviation of these 1000 estimates and it results to be close to the standard error of my estimator. :

\[ SE(\hat{\theta})=\sqrt{E(\hat{\theta}-E(\hat{\theta}))^ 2} \]

\[ s(\hat{\theta}_1,...,\hat{\theta}_{1000})= \sqrt{\frac{1}{999}\sum^{1000}_{i=1}(\hat{\theta}-mean(\hat{\theta_i}))^ 2} \approx SE(\hat{\theta}) \]

The caveat of this method is that it will only work where we can extract as many samples as we wish.

Usually we cannot have as many samples as we wish, when we want to use the Monte Carlo Method in practice, it is typical to assume a parametric distribution and generate a population from this, which is called a parametric simulation. This means that we take parameters estimated from the real data and using the mean and standard deviation, we plug these into a model of normal distribution and generate new samples.

For example for the case of mice weights, we know that mice typically weight 24 gr with a SD of about 3.5 gr. and that the distribution is approximately normal, to generate population data:

Code
controls<- rnorm(5000, mean=24, sd=3.5) 

12.2 Cross-validation

As the models become more complex (the more features we add to our model), the training error becomes lower. On the other hand, the test error does not reduces at the same rate as the model gets more complex, it starts going down but then it starts increasing again as the model starts overfitting to the training dataset.

The prediction error comes from bias and variance. The bias is how var off on the average the model is from the truth. The variance is how much the estimate varies around its average. The less parameters we fit into our model, the more bias we have (we will find it more difficult to find the true mean) and as we increase the parameters, the bias reduces, but the variance goes up, because we have more and more parameters to estimate from the same amount of data. The solution for this would be a large designated test set, but this is not often available.

There are some methods to try to adjust the training error so it does not over fit and produces an increase in the testing error rate. We randomly divide the available set of samples into two parts: a training set and a validation or hold out set. The model is fit on the training set and the fitted model is used to predict responses for the observations in the validation set. The resulting validation-set error provides an estimate of the test error. This is typically assessed using Mean Squared Error (MSE) in the case of quantitative response and misclassification rate in the case of qualitative (discrete) responses.

Code
# Set up the plotting area for two plots side by side
par(mfrow = c(1, 2))

# Plot 1: MSE with All Data
# Initialize a vector to store MSE errors for each polynomial degree
mse_error_all <- rep(0, 10)

# Loop to calculate MSE errors for polynomial degrees 1 to 10
for (i in 1:10) {
  glm_fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
  predictions <- predict(glm_fit)
  actuals <- Auto$mpg
  mse_error_all[i] <- mean((predictions - actuals)^2)
}

# Plot the MSE errors for all data
plot(1:10, mse_error_all, 
     type = "b", 
     col = 1, 
     ylim= c(0,(max(mse_error_all) + 1)),
     xlab = "Degree of Polynomial", 
     ylab = "MSE", 
     main = "MSE with All Data")

# Plot 2: MSE with Different Samples
# Initialize a list to store MSE errors for each sample
mse_error_samples <- list()

# Loop to create different samples and calculate MSE errors
for (j in 1:10) {
  # Create a different sample
  train <- sample(c(TRUE, FALSE), nrow(Auto), replace = TRUE, prob = c(0.6, 0.4))
  
  # Initialize a vector to store MSE errors for the current sample
  mse_error_samples[[j]] <- rep(0, 10)
  
  # Loop to calculate MSE errors for polynomial degrees 1 to 10
  for (i in 1:10) {
    glm_fit <- glm(mpg ~ poly(horsepower, i), data = Auto[train, ])
    predictions <- predict(glm_fit, newdata = Auto[!train, ])
    actuals <- Auto$mpg[!train]
    mse_error_samples[[j]][i] <- mean((predictions - actuals)^2)
  }
}

y_max_samples <- max(unlist(mse_error_samples)) + 1

# Plot the MSE errors for different samples
plot(1:10, mse_error_samples[[1]], type = "b", col = 1, xlab = "Degree of Polynomial", ylab = "MSE", main = "MSE with Different Samples",
  ylim = c(0, y_max_samples))
for (j in 2:10) {
  lines(1:10, mse_error_samples[[j]], type = "b", col = j)
}
legend("bottomleft", legend = paste("Sample", 1:10), col = 1:10, lty = 1, cex = 0.8)

two-fold cross validation

In the first graph we can see that a polynomial of level 2 reduces drastically the MSE from a linear model, but more polynomial terms do not have the same effect. The second graph shows us how the MSE vary with different samples, so the choice of the sample will have an effect on the fit of the model. Another drawback of this method is that, by fitting the model to a smaller number of data points, the estimated error will be higher than fitting the model to the whole available data points.

Now we will see some techniques that try to solve some of those drawbacks.

12.2.1 K-fold Cross-validation

K-fold cross validation is a widely used approach for estimating test error. The idea is to randomly divide the data into \(K\) equal-sized parts. We leave out part \(k\), fit the model to the other \(K-1\) parts (combined) and then obtain predictions for the left-out \(kth\) part.This is done in turn for each part k, and then the results are combined. \(K\) is usually between 5 and 10. Let’s the \(K\) parts be \(C_1,C_2,\dots,C_k\). where \(C_k\) denotes the indices of the observations in part k. There are \(n_k\) observations in part k. \[ CV_{(K)}= \sum^K_{k=1}\frac{n_k}{n} MSE_k \tag{12}\]

Where: \(CV_{(K)}\): This represents the cross-validation error estimate for K-fold cross-validation. \(\sum^K_{k=1}\): This indicates that the sum is taken over all \(K\) folds. \(\frac{n_k}{n}\): This is the proportion of the total data used in the k-th fold, where \(n_k\) is the number of observations in the k-th fold and \(n\) is the total number of observations. \(MSE_k\): This is the Mean Squared Error for the k-th fold. \(MSE_k=\sum_{i\epsilon C_k}(y_i-\hat{y_i})^2/n_k\) and \(\hat{y_i}\) is the fit for the observation i, obtained from the data with the part \(K\) removed.

One special use of the k-fold cross validation is setting K=n, this is called leave-one-out cross validation (LOOCV): each observation is left out and used as a validation set, while the whole set of other data points are used for training the model. Leave one out cross validation for linear or polynomial model uses this formula: \[ CV_{(n)}=\frac{1}{n} \sum_{i=1}^n \left(\frac{y_i-\hat{y_i}}{1-h_i} \right)^2 \tag{13}\]

where \(\hat{y_i}\) is the ith fitted value for the original least squares fit and \(h_i\) is the leverage (diagonal of the “hat” matrix). This is like the ordinary MSE except the ith residual is divided by \(1-h_i\) The \(h_i\) tells you how much influence a single point has on the its own fit, it will be a number between 0 and 1.

Cross-validation takes the average of the errors over the n-folds so in LOOCV each sample looks much more like the other ones, because it only leaves one point out. As a result, each error estimate are highly correlated, when you average these highly correlated errors, the resulting average has a high variance -it is more sensitive to slight changes or outliers- because there’s little variation in the training data across folds. It’s not about the individual errors being different; it’s about them moving together because the training sets are too similar. Remember the trade off we talked about between bias and variance in the Bias-variance trade off section

For this reason it is often preferred to use k=5 or 10. Let’s compute the cross-validation in r using the formula we saw above. First we are going to fit the model to a linear model. We can use the function glm and if we don’t pass any family type it will fit to a linear model.

Code
#linear model:
fit.lm<- glm(mpg ~ horsepower, data = Auto)
loocv<- function(fit){
  h <-lm.influence(fit)$h
  mean((residuals(fit)/(1-h))^2)
}

loocv(fit.lm)
[1] 24.23151

But we have a handy function in r that will do this for us: We will use the library(boot) to calculate the cross validation errors for the Auto dataset using the function cv.glm. If we don’t pass any value of k, it will default to leave-one-out cross-validation.

Code
glm.fit <- glm(mpg ~ horsepower, data = Auto)
cv.err <- boot::cv.glm(Auto, glm.fit)
cv.err$delta
[1] 24.23151 24.23114

And we see that we get the same result that the one we calculated manually above. The first Delta is the cross-validation prediction error, and the second one is a biased corrected version of it, taking into consideration that the dataset is smaller for the dataset in cross validation. This has little effect in leave one out, but it will have a bigger effect in k=n cross-validation.

Now we are going to fit polynomials of different degrees to our data because it looks non-linear, so we expect some of these to fit better than a linear model.

Code
# Leave one out
cv.error <- rep(0, 10)
for (i in 1:10) {
  glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
  cv.error[i] <- boot::cv.glm(Auto, glm.fit)$delta[1]
}
# Plot the MSE errors for all data
plot(1:10, cv.error, 
     type = "b", 
     col = 1, 
     ylim= c(2,(max(cv.error) + 1)),
     xlab = "Degree of Polynomial", 
     ylab = "MSE", 
     main = "Leave One Out")

LOOCV for different polynomials

Now let’s try 10-k fold cross-validation.

Code
## $k$-Fold Cross-Validation
# Initialize a list to store CV errors for each sample
cv.error.10 <- list()

# Loop to create different samples and calculate MSE errors
for (j in 1:10) {
  # Initialize a vector to store MSE errors for the current sample
  cv.error.10[[j]] <- rep(0, 10)
  
  # Loop to calculate MSE errors for polynomial degrees 1 to 10
  for (i in 1:10) {
    glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
    cv.error.10[[j]][i] <- boot::cv.glm(Auto, glm.fit, K = 10)$delta[1]
  }
}

y_max_samples <- max(unlist(cv.error.10)) + 1

# Plot the MSE errors for different samples
plot(1:10, cv.error.10[[1]], type = "b", col = 1, xlab = "Degree of Polynomial", ylab = "MSE", main = "k=10 fold CV",
  ylim = c(2, y_max_samples))
for (j in 2:10) {
  lines(1:10, cv.error.10[[j]], type = "b", col = j)
}
legend("bottomleft", legend = paste("k", 1:10), col = 1:10, lty = 1, cex = 0.8)

10k-fold cross validation

We can see how the variability in 10-k fold is much less than what we obtained before using half of the data (2-k)

For classification problems the k-fold cross-validation works exactly the same, the only difference is that instead of MSE to measure the cross validation error we use the [misclassification error rate] (#misclassificationErrorRate).

12.2.2 Potential problems with cross-validations.

Consider a simple classifier applied to some two-class data. We have 5000 predictors and 50 samples. With this, we find the 100 predictors having the largest correlation with the class labels. We then apply a classifier such as logistic regression using only those 100 predictors. If we now try to estimate the test set performance of this classifier using cross-validation, we will get a much lower error estimate than real. This is because when we choose the 100 best predictors, this is a form of training for the model, and must be included in the validation process. If for example we simulate data with the class labels independent of the outcome (so our test error is 50%) and then we choose the best predictors and create a model with them, when we calculate the cross-validation, our error estimate will be smaller than it should be.

To avoid this we will have to create our folds and remove our test fold before the process of fitting and choosing the predictors, and then apply the cross validation over the fold we set apart.

You can find more information in the lectureVideo

Example: Model selection with Cross Validation:

We now try to choose among the models of different sizes using cross-validation. We must perform best subset selection within each of the \(k\) training sets. Despite this, we see that with its clever subsetting syntax, R makes this job quite easy.

First, we create a vector that allocates each observation to one of \(k=10\) folds, and we create a matrix in which we will store the results.

Code
k <- 10
n <- nrow(Hitters)

folds <- sample(rep(1:k, length = n))
cv.errors <- matrix(NA, k, 19,
    dimnames = list(NULL, paste(1:19)))

Now we write a for loop that performs cross-validation. In the \(j\)th fold, the elements of folds that equal j are in the test set, and the remainder are in the training set. We make our predictions for each model size (using our new predict() method), compute the test errors on the appropriate subset, and store them in the appropriate slot in the matrix cv.errors. Note that in the following code R will automatically use our predict.regsubsets() function when we call predict() because the best.fit object has class regsubsets.

Code
for (j in 1:k) {
  best.fit <- regsubsets(Salary ~ .,
       data = Hitters[folds != j, ],
       nvmax = 19)
  for (i in 1:19) {
    pred <- predict(best.fit, Hitters[folds == j, ], id = i)
    cv.errors[j, i] <-
         mean((Hitters$Salary[folds == j] - pred)^2)
   }
}

This has given us a \(10 \times 19\) matrix, of which the \((j,i)\)th element corresponds to the test MSE for the \(j\)th cross-validation fold for the best \(i\)-variable model. We use the apply() function to average over the columns of this matrix in order to obtain a vector for which the \(i\)th element is the cross-validation error for the \(i\)-variable model.

Code
(mean.cv.errors <- apply(cv.errors, 2, mean))
       1        2        3        4        5        6        7        8 
148937.9 125976.0 133478.0 143989.3 137791.8 124703.5 130471.8 121228.7 
       9       10       11       12       13       14       15       16 
123781.0 119869.0 120956.0 123365.1 123806.9 122523.4 122435.2 122442.8 
      17       18       19 
122400.9 122566.9 122602.5 
Code
par(mfrow = c(1, 1))

plot(sqrt(mean.cv.errors), type = "b", ylab="sqrt MSE")

We see that cross-validation selects a 10-variable model. We now perform best subset selection on the full data set in order to obtain the 10-variable model.

Code
reg.best <- regsubsets(Salary ~ ., data = Hitters,
    nvmax = 19)
coef(reg.best, 10)
 (Intercept)        AtBat         Hits        Walks       CAtBat        CRuns 
 162.5354420   -2.1686501    6.9180175    5.7732246   -0.1300798    1.4082490 
        CRBI       CWalks    DivisionW      PutOuts      Assists 
   0.7743122   -0.8308264 -112.3800575    0.2973726    0.2831680 

12.3 The bootstrap principle

The bootstrap is a flexible and powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical method. For example, it can provide an estimate of the standard error of a coefficient, or a confidence interval for that coefficient.

We have an estimate \(\hat{\theta}\) for a parameter \(\theta\) and want to know how accurate our estimate is. For that we need to know the \(SE\) of the estimate or give a confidence interval for the parameter \(\theta\). The bootstrap can do this in quite general settings:

Example \(\theta\)= average height of all people in the US. It is unknown but estimated with the average height of 100 randomly selected people. We can’t compute the population mean because we can’t access the whole population. So we ’ plug in’ the sample in place of the population and compute the mean of the sample instead. The rationale for the plug-in principle is that the sample mean will be close to the population mean because the sample histogram is close to the population histogram.

The bootstrap uses the plug-in principle and the Monte Carlo Method to approximate quantities such as \(SE(\hat{ \theta})\) when we cannot get many samples.

The bootstrap pretends that the sample histogram is the population histogram and uses the Monte Carlo to simulate the quantity of interest.

What we do is we draw n samples with replacement from the original sample, each sample is the same size \(n\) as the original dataset, and with those samples we use the Monte Carlo principle to calculate the quantity of interest. Example of 4 samples chosen at random from a dataset with three points:

Code
x <- rnorm(3,mean=5,sd=2)
Y <- 0.2+0.33*x+ rnorm(1, -1.1, .8)

pop<- data.frame(obs=c("A","B","C"),x,Y)
pop
  obs        x         Y
1   A 3.076568 0.3463163
2   B 4.554017 0.8338746
3   C 4.059499 0.6706834
Code
#bootstrap samples:
set.seed(1)
s<- sample(c(1,2,3),3,replace=TRUE)
pop[s,]
    obs        x         Y
1     A 3.076568 0.3463163
3     C 4.059499 0.6706834
1.1   A 3.076568 0.3463163
Code
set.seed(2)
s<- sample(c(1,2,3),3,replace=TRUE)
pop[s,]
  obs        x         Y
1   A 3.076568 0.3463163
3   C 4.059499 0.6706834
2   B 4.554017 0.8338746
Code
set.seed(3)
s<- sample(c(1,2,3),3,replace=TRUE)
pop[s,]
  obs        x         Y
1   A 3.076568 0.3463163
2   B 4.554017 0.8338746
3   C 4.059499 0.6706834
Code
set.seed(4)
s<- sample(c(1,2,3),3,replace=TRUE)
pop[s,]
    obs        x         Y
3     C 4.059499 0.6706834
3.1   C 4.059499 0.6706834
3.2   C 4.059499 0.6706834

Drawing a bootstrap sample by sampling with replacement from the data is called nonparametric bootstrap.

If we know that the data follow a normal distribution, but we don’t know the mean on the standard deviation. The obvious thing to do there is to simply estimate the unknown mean and standard deviation, and then simply sample from the normal distribution, with that mean and standard deviation. That’s called parametric bootstrap.

This type of sampling works if the data are independent, that is, \(X_1\) to \(X_n\) are really generated independently. But oftentimes, there’s dependence in the data. For example, the data are observed over time.

If the sampling distribution of \(\hat{\theta}\) is approximately normal, then the confidence interval formula is :

\[ \left[ \hat{\theta} - z_{\alpha/2} SE(\hat{\theta}), \hat{\theta} + z_{\alpha/2} SE(\hat{\theta}) \right] \tag{14}\]

This formula represents a confidence interval for an estimator \(\hat{\theta}\), where:

  • \(\hat{\theta}\) is the point estimate.

  • \(z_{\alpha/2}\) is the critical value from the standard normal distribution for a given confidence level. For example for a confidence level of 95% \(\alpha =(1-0.95)=0.05\) ; \(\frac{\alpha}{2}=0.025\), \(z_{\frac{\alpha}{2}}=1.96\)

  • \(SE(\hat{\theta})\) is the standard error of the estimator.

If the distribution is not normal, we can still use the bootstrap by estimating that the distribution of the estimate can be approximated by the distribution of the bootstrap-copies extracted from it, so we can draw a distribution curve and calculate the 95% confidence interval using that curve:

\[ \left[ \hat{\theta}^*_{\alpha/2}, \hat{\theta}^*_{1-\alpha/2} \right] \tag{15}\]

where \(\hat{\theta}^*_{\alpha/2}\) is the \(\alpha/2\) percentile of the \(\hat{\theta}^*_{1},....\hat{\theta}^*_{B}\)

Estimating the accuracy of a linear regression Model

The bootstrap approach can be used to assess the variability of the coefficient estimates and predictions from a statistical learning method. Here we use the bootstrap approach in order to assess the variability of the estimates for \(\beta_0\) and \(\beta_1\) (intercept and slope) for a linear regression model that uses horsepower to predict mpg (miles per gallon) in the Auto data set.

We first create a simple function boot.fn() which takes in the Auto data set as well as the indices for the observations, and returns the intercept and slope estimates for the linear regression model.

Code
boot.fn<- function (data, index){
  coef(lm(mpg ~ horsepower, data= data, subset = index))
}

We then apply this function to the full set of 392 observations in order to compute get the slope and intersect for the full data set:

Code
boot.fn(Auto,1:392)
(Intercept)  horsepower 
 39.9358610  -0.1578447 

if we use bootstrap principle we just need to create a sample with replacement of the same size of the full data set. This is one sample example:

Code
boot.fn(Auto, sample(392,392, replace=TRUE))
(Intercept)  horsepower 
 39.7916909  -0.1562981 

but what we do in r is to use the boot() function from the boot library to compute the standard errors of 1,000 bootstrap estimates for the intercept and the slope:

Code
set.seed(1)
boot(Auto, boot.fn,1000)

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Auto, statistic = boot.fn, R = 1000)


Bootstrap Statistics :
      original        bias    std. error
t1* 39.9358610  0.0553942585 0.843931305
t2* -0.1578447 -0.0006285291 0.007367396

This indicates that the bootstrap estimate for \(SE(\beta_0)\) is 0.84 and for \(SE(\beta_1)\) is 0.0073

Standard formulas can be used to compute the standard errors for the regression coefficients in a linear model. These can be obtained using the summary() function.

Code
summary(lm(mpg ~ horsepower, data= Auto))$coef
              Estimate  Std. Error   t value
(Intercept) 39.9358610 0.717498656  55.65984
horsepower  -0.1578447 0.006445501 -24.48914
                                                                                                                                                                                                       Pr(>|t|)
(Intercept) 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001220362
horsepower  0.0000000000000000000000000000000000000000000000000000000000000000000000000000000070319890294050630351850583732442601103684864938259124755859375000000000000000000000000000000000000000000000000000

these standard errors are calculated using the formulas we saw in Assessing the Accuracy of the Coefficient Estimates and Confidence Intervals

The formula for the slope and for the intercept

These formulas rely on certain assumptions (out of the scope for this document, check the ISLR book for more info). The bootstrap approach does not rely on any of these assumptions, and so, it is likely giving more accurate estimate of the standard errors than the summary function

If the data is a time series, we can’t simply sample the observations with replacement because the data is not independent. There’s correlation in time series. For example we want to predict the value of a stock price using the previous days stock price. We expect the stock price for a given day to be correlated with the stock price from the previous day. This creates a problem for the bootstrap because bootstrap assumes the data are independent.

What people uses for time series is the block boostrap, it divides the data up into blocks and we assume that in between blocks, the data is independent.

Another instance where bootstrap is not a good approach is for estimating the prediction error in cross-validation problems. In cross-validation, each of the \(K\) validation folds is distinct from the other K-1 folds used for training, there is no overlap, this is crucial for its success. But each bootstrap sample has significant overlap with the original data. About two-thirds of the original data points appear in each bootstrap sample. This will cause the bootstrap to seriously underestimate the true prediction error.

Estimating the Accuracy of a Statistic of Interest

We will be using the dataset Portfolio from LSLR package for this practice. Suppose that we wish to invest a fixed sum of money in two financial assets that yield returns of X and Y, respectively, where X and Y are random quantities. We will invest a fraction α of our money in X, and will invest the remaining 1 − α in Y . Since there is variability associated with the returns on these two assets, we wish to choose α to minimize the total risk, or variance, of our investment.In other words, we want to minimize Var(αX +(1−α)Y ). The value that minimizes the risk is given by the formula:

\[ \alpha= \frac{\sigma^2_Y-\sigma_{XY}}{\sigma^2_X+\sigma^2_Y-2\sigma_{XY}} \]

where \(\sigma^2_X\)= VAR(X), \(\sigma^2_Y\)= VAR(Y) and \(\sigma_{XY}\) = COV(X,Y). These quantities are unknown, so we will estimate them using our data. We can write that function in r:

Code
alpha.fn <- function(data, index) {
  X <- data$X[index]
  Y <- data$Y[index]
  (var(Y) - cov(X, Y)) / (var(X) + var(Y) - 2 * cov(X, Y))
}

alpha.fn(Portfolio, 1:nrow(Portfolio))
[1] 0.5758321

but what is the sampling variability of α? we use boostrap for this:

Code
set.seed(1)
boot.out<- boot::boot(Portfolio, alpha.fn, R = 1000)
boot.out

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot::boot(data = Portfolio, statistic = alpha.fn, R = 1000)


Bootstrap Statistics :
     original       bias    std. error
t1* 0.5758321 -0.001596422  0.09376093
Code
plot(boot.out)

the statistic came as 0.5758 as before and we now know our standard error that came as 0.093

The plots show you the distribution of the samples and the QQ plot.

12.4 Use of Monte Carlo Simulation and Bootstrap

Simulations can also be used to check theoretical or analytical results. Also, many of the theoretical results we use in statistics are based on asymptotics: they hold when the sample size goes to infinity. In practice, we never have an infinite number of samples so we may want to know how well the theory works with our actual sample size. Sometimes we can answer this question analytically, but not always. Simulations are extremely useful in these cases.

As an example we are going to use simulation to compare the Central Limit Theorem to the t-distribution for different sample sizes. We will use our mice data and extract control and ‘fake’ treatment samples. The fake treatment samples are just samples labeled as treatment but extracted from the control population.

We will build a function that automatically generates a t-statistic under the null hypothesis for a sample size of n.

Code
dat <- read.csv("data/mice_pheno.csv")
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%  
  dplyr::select(Bodyweight) %>% unlist

ttestgenerator <- function(n) {
  cases <- sample(controlPopulation,n)
  controls <- sample(controlPopulation,n)
  tstat <- (mean(cases)-mean(controls)) / 
      sqrt( var(cases)/n + var(controls)/n ) 
  return(tstat)
  }
ttests <- replicate(1000, ttestgenerator(10))
Code
hist(ttests)

Histogram of 1000 Monte Carlo simulated t-statistics.

we can use quantile-quantile plots to see how well this distribution approximates to the normal:

Code
qqnorm(ttests)
abline(0,1)

Quantile-quantile plot comparing 1000 Monte Carlo simulated t-statistics to theoretical normal distribution.

This looks like a very good approximation. For this particular population, a sample size of 10 was large enough to use the CLT approximation. How about sample size 3?

Code
ttests <- replicate(1000, ttestgenerator(3))
qqnorm(ttests)
abline(0,1)

Quantile-quantile plot comparing 1000 Monte Carlo simulated t-statistics with three degrees of freedom to theoretical normal distribution.

Now we see that the large quantiles, referred to by statisticians as the tails, are larger than expected (below the line on the left side of the plot and above the line on the right side of the plot). In the previous module, we explained that when the sample size is not large enough and the population values follow a normal distribution, then the t-distribution is a better approximation. Our simulation results seem to confirm this:

In the code below first we generate the percentiles: ps <- (seq(0,999)+0.5)/1000 seq(0,999) generates a sequence of numbers from 0 to 999. Adding 0.5 to each element centers the sequence values. Dividing by 1000 scales the values to be between 0 and 1. This results in 1000 equally spaced percentiles between 0.0005 and 0.9995. qt(ps, df=2*3-2) computes the quantiles of the t-distribution for the given percentiles ps with degrees of freedom df=2*3-2 qqplot creates a QQ plot comparing the quantiles of our t-test results to the theoretical quantiles of the t-distribution.

Code
ps <- (seq(0,999)+0.5)/1000
qqplot(qt(ps,df=2*3-2),ttests,xlim=c(-6,6),ylim=c(-6,6))
abline(0,1)

Quantile-quantile plot comparing 1000 Monte Carlo simulated t-statistics with three degrees of freedom to theoretical t-distribution.

The QQ plot helps you visually assess how well your t-test results follow the theoretical t-distribution. The t-distribution is a much better approximation in this case, but it is still not perfect. This is due to the fact that the original data is not that well approximated by the normal distribution. We can see this plotting the qqplot for the full population data:

Code
qqnorm(controlPopulation)
qqline(controlPopulation)

Quantile-quantile of original data compared to theoretical quantile distribution.

13 Shrinkage Methods

Ridge regression and lasso.

The subset selection methods use least squares to fit a linear model that contains a subset of the predictors. As an alternative, we can fit a model containing all \(p\) predictors using a technique that constrains or regularizes the coefficients estimates, or equivalently, that shrinks the coefficient estimates towards zero.Shrinking the coefficient estimates can significantly reduce their variance.

Ridge regression Also called \(L2\) Regularization. It addresses the problem of multicollinearity (when predictor variables are highly correlated). It is useful when you have many predictors and suspect multicollinearity. It works well when the number of predictors is larger than the number of observations.

\[ \sum^n_{i=1}\left( y_i-\beta_0-\sum^p_{j=1}\beta_jx_{ij}\right) + \lambda\sum^p_{j=1}\beta^2_j \tag{16}\]

the first part of the function is RSS so we can write it like this: \[ RSS + \lambda\sum^p_{j=1}\beta^2_j \] We call \(\lambda\sum^p_{j=1}\beta^2_j\) the Ridge Penaly where \(\lambda\) is called the tuning parameter, to be determined separately.It controls the strength of the penalty. When \(\lambda\) is 0, it is just ordinary least square regression, as \(\lambda\) increases, the flexibility of the model decreases, helping to reduce overfitting.

As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small, however, the second term of the function, called shrinkage penalty, is small when the coefficients are close to zero, and so it has the effect of shrinking the estimates towards zero.

One thing to take into account with this methods is that while in our regression formulas the units or scales of our variables do not matter, (if we measure the weight in grams or in kilograms) because the coefficient will adapt to that scale, but in this method that actually matters, and as a result is is important to standardize the predictors before applying ridge regression. To standardize he predictors we do as usual, we divide the feature by the standard deviation of that feature. \[ \tilde x_{ij} = \frac{x_{ij}}{SD(x_{ij})} =\frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n}(x_{ij-\bar{x_j}})^2} \] Ridge regression does have one obvious disadvantage: unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all \(p\) predictors in the final model.

Lasso Also called \(L1\) regularization. The Lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The formula is very similar to the ridge regression but instead of taking the squares of the coefficients, now it is the sum of the absolute value: \[ RSS + \lambda\sum^p_{j=1}|\beta_j| \] The difference is that it actually sets the coefficients to 0 when lambda is large enough, so it subsets the selection removing the features that are not important.

Selecting the Tuning Parameter for Ridge Regression and Lasso

As in ridge regression, selecting a good value of lambda for the lasso is critical, we will use again cross validation for this. Lasso is particularly useful when you expect only a few of predictors to be relevant, leading to a simpler, more interpretable model. If most of the predictors are relevant, ridge regression will do better selecting the best model. We don’t know this when we create the model, so it is usual to use both methods, and use cross-validation to determine the best model coming out of each method by comparing the cross-validated error.

Cross validation provides a simple way to find the value of \(\lambda\) both for lasso and ridge regression. We choose a grid of lambda values, and compute the cross-validation error rate for each value of it. We then select the tuning parameter value for which the cross-validation error is smallest. Finally the model is re-fit using all of the available observations and the selected value of the tuning parameters.

Example: Ridge regression

We will use the glmnet package in order to perform ridge regression and the lasso. The main function in this package is glmnet(), which can be used to fit ridge regression models, lasso models, and more. This function has slightly different syntax from other model-fitting functions that we have encountered thus far. In particular, we must pass in an x matrix as well as a y vector, and we do not use the y ~ x syntax. We will now perform ridge regression and the lasso in order to predict Salary on the Hitters data. Before proceeding ensure that the missing values have been removed from the data. The model.matrix() function is particularly useful for creating x; not only does it produce a matrix corresponding to the 19 predictors but it also automatically transforms any qualitative variables into dummy variables. The latter property is important because glmnet() can only take numerical, quantitative inputs.

Code
Hitters <- na.omit(Hitters)
x <- model.matrix(Salary ~ . ,data= Hitters)[, -1]
y <- Hitters$Salary

The glmnet() function has an alpha argument that determines what type of model is fit. If alpha=0 then a ridge regression model is fit, and if alpha=1 then a lasso model is fit. We first fit a ridge regression model.

By default the glmnet() function performs ridge regression for an automatically selected range of λ values. However, here we have chosen to implement the function over a grid of values ranging from \(λ=10^{10}\) to \(λ=10^{−2}\) , essentially covering the full range of scenarios from the null model containing only the intercept, to the least squares fit. As we will see, we can also compute model fits for a particular value of λ that is not one of the original grid values. Note that by default, the glmnet() function standardizes the variables so that they are on the same scale. To turn off this default setting, use the argument standardize = FALSE.

Code
grid <- 10^seq(10, -2, length = 100)
ridge.mod <- glmnet(x, y, alpha = 0, lambda = grid)

Associated with each value of λ is a vector of ridge regression coefficients, stored in a matrix that can be accessed by coef(). In this case, it is a 20×100 matrix, with 20 rows (one for each predictor, plus an intercept) and 100 columns (one for each value of λ).

Code
dim(coef(ridge.mod))
[1]  20 100
Code
ridge.mod$lambda[50] #the value of lambda at possition 50
[1] 11497.57
Code
coef(ridge.mod)[,50] #coefficients for that value of lambda
  (Intercept)         AtBat          Hits         HmRun          Runs 
407.356050200   0.036957182   0.138180344   0.524629976   0.230701523 
          RBI         Walks         Years        CAtBat         CHits 
  0.239841459   0.289618741   1.107702929   0.003131815   0.011653637 
       CHmRun         CRuns          CRBI        CWalks       LeagueN 
  0.087545670   0.023379882   0.024138320   0.025015421   0.085028114 
    DivisionW       PutOuts       Assists        Errors    NewLeagueN 
 -6.215440973   0.016482577   0.002612988  -0.020502690   0.301433531 
Code
plot(ridge.mod, xvar="lambda", label=TRUE)

This plot is showing how the coefficients shrink as we increase the value of lambda.

We can use predict() function for a number of purposes For instance we can obtain the ridge regression coefficients for a new value of lambda, say 50:

Code
predict(ridge.mod, s=50, type= 'coefficients')[1:20,]
   (Intercept)          AtBat           Hits          HmRun           Runs 
  48.766103292   -0.358099859    1.969359286   -1.278247981    1.145891632 
           RBI          Walks          Years         CAtBat          CHits 
   0.803829228    2.716185796   -6.218319217    0.005447837    0.106489514 
        CHmRun          CRuns           CRBI         CWalks        LeagueN 
   0.624485956    0.221498464    0.218691380   -0.150024549   45.925885514 
     DivisionW        PutOuts        Assists         Errors     NewLeagueN 
-118.201136816    0.250232154    0.121566461   -3.278599545   -9.496680310 

We now split the samples into a training set and a test set in order to estimate the test error of ridge regression and fit a ridge regression model on the training set and evaluate its MSE on the test set using lambda=4. Note the use of predict() this time we get predictions for a test set by replacing the type="coefficients" with the newx argument

Code
set.seed(1)
train<- sample(1:nrow(x), nrow(x)/2)
test <- (-train)
y.test <- y[test]

ridge.mod <- glmnet(x[train,], y[train], alpha=0, lambda=grid, thresh= 1e-12)
ridge.pred <- predict(ridge.mod, s=4, newx=x[test,])
(MSE<- mean((ridge.pred-y.test)^2))
[1] 142199.2

the test MSE = 142199.1507228

We now check whether there is any benefit to performing ridge regression with lambda=4 instead of just performing least squares regression (lambda=0). In order for glmnet() to yield exact least squares coefficients when lambda is 0, we use the argument exact=T

Code
ridge.pred <- predict(ridge.mod, s = 0, newx = x[test, ],
    exact = T, x = x[train, ], y = y[train])
mean((ridge.pred - y.test)^2)
[1] 168588.6

In general, instead of arbitrarily choosing a value for lambda, it would be better to use cross-validation to choose the tuning parameter. We can do this using the built in cross-validation function.

The glmnet package have a function cv.glmnet() that will do the cross validation. By default the function performs ten-fold cross-validation, though this can be changed using the argument nfold.

Code
set.seed(1)
cv.ridge<- cv.glmnet(x[train, ], y[train], alpha = 0)
plot(cv.ridge)

WE can see here that with higher values of lambda the mean squares errors are too high (the restricted coefficients are too small to fit the model properly). The vertical lines shows the minimum on the left and the other vertical line is at one standard error from the minimum. The top x axis shows the number of variables in the model. In this case what the graph is showing is that the model without adjustment with the full coefficients for all the variables is doing quite well to fit the model.

The best value for lambda that results in the smallest cross-validation error is 326. What is the MSE associated with this value of lambda?

Code
bestlam <- cv.ridge$lambda.min
bestlam
[1] 326.0828
Code
ridge.pred <- predict(ridge.mod, s = bestlam,
    newx = x[test, ])
mean((ridge.pred - y.test)^2)
[1] 139856.6

This represents an improvement over lambda 0. Finally we refit our ridge regression model on the full data set, using the value of lambda chosen by cross-validation and examine the coefficient estimates.

Code
out <- glmnet(x, y, alpha = 0)
predict(out, type = "coefficients", s = bestlam)[1:20, ]
 (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
 15.44383120   0.07715547   0.85911582   0.60103106   1.06369007   0.87936105 
       Walks        Years       CAtBat        CHits       CHmRun        CRuns 
  1.62444617   1.35254778   0.01134999   0.05746654   0.40680157   0.11456224 
        CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
  0.12116504   0.05299202  22.09143197 -79.04032656   0.16619903   0.02941950 
      Errors   NewLeagueN 
 -1.36092945   9.12487765 

Example: lasso

We saw that ridge regression with a wise choice of λ can outperform least squares on the Hitters data set. We now ask whether the lasso can yield either a more accurate or a more interpretable model than ridge regression. In order to fit a lasso model, we once again use the glmnet() function; however, this time we use the argument alpha=1. Other than that change, we proceed just as we did in fitting a ridge model.

Code
lasso.mod <- glmnet(x[train, ], y[train], alpha = 1,
    lambda = grid)
plot(lasso.mod, xvar="lambda", label= TRUE)

We can see from the coefficient plot that depending on the choice of tuning parameter, some of the coefficients will be exactly equal to zero. The top x axes tells us how many variable coefficients are not 0 We now perform cross-validation and compute the associated test error.

Code
set.seed(1)
cv.out <- cv.glmnet(x[train, ], y[train], alpha = 1)
plot(cv.out)

Code
bestlam <- cv.out$lambda.min

#test error:
lasso.pred <- predict(lasso.mod, s = bestlam,
    newx = x[test, ])
mean((lasso.pred - y.test)^2)
[1] 143673.6

This is substantially lower than the test set MSE of the null model and of least squares, and very similar to the test MSE of ridge regression with \(\lambda\) chosen by cross-validation.

However, the lasso has a substantial advantage over ridge regression in that the resulting coefficient estimates are sparse. Here we see that 8 of the 19 coefficient estimates are exactly zero. So the lasso model with \(\lambda\) chosen by cross-validation contains only eleven variables.

We now run the model over the full dataset:

Code
out <- glmnet(x, y, alpha = 1, lambda = grid)
lasso.coef <- predict(out, type = "coefficients",
    s = bestlam)[1:20, ]
lasso.coef[lasso.coef != 0]
  (Intercept)         AtBat          Hits         Walks         Years 
   1.27479059   -0.05497143    2.18034583    2.29192406   -0.33806109 
       CHmRun         CRuns          CRBI       LeagueN     DivisionW 
   0.02825013    0.21628385    0.41712537   20.28615023 -116.16755870 
      PutOuts        Errors 
   0.23752385   -0.85629148 

14 Dimension Reduction Methods

To recapitulate, we have seen subset selection methods, where we were just taking a subset of the predictors and using least squares models choosing the best model. Then ridge regression and lasso, where we were taking all of the predictors, but it does not use least squares, but a shrinkage approach to fit the model. Now we are going to use least squares but not on the original predictors, instead we are going to come up with new predictors, which are linear combinations of the original ones.

Let \(Z_1,Z_2,\dots,Z_M\) represent linear combination of some of the \(p\) original predictors and use them to fit a linear regression model:

\[ y_i=\theta_0 + \sum_{m=1}^M \theta_mz_{im}+\epsilon_i \tag{17}\]

Principal Components Regression (PCR)

The first principal component is that (normalized) linear combination of the variables with the largest variance. The second principal component has largest variance, subject to being uncorrelated with the first etc.

Partial Least Squares (PLS) Like PCR, PLS is a dimension reduction method, which first identifies a new set of features that are linear combinations of the original features, and then fits a linear model. After standardizing the \(p\) predictors, PLS computes the first direction \(Z_1\) by setting each \(\phi_{ij}\) equal to the coefficient from the simple linear regression of Y onto \(X_j\), hence, PLS places the highest weight on the variables that are most strongly related to the response. Subsequent directions (Z) are found by taking residuals and then repeating the above process.

15 Multiple hypothesis testing

In the Common Errors in significance testing we saw that rejecting \(H_0\) if the \(p\)-value is below a certain \(\alpha\) threshold is a simple way to control Type I error. But now suppose that we wish to test \(m\) null hypothesis. If we reject all null hypothesis for which the \(p\)-value falls bellow our \(\alpha\), then how many Type I errors should we expect to make?

Suppose that we flip 1,024 fair coins ten times each. Then we would expect (on average) one coin to come up all tails. (There’s a \(\frac{1}{2}^{10} = \frac{1}{1,024}\) chance that any single coin will come up all tails. So if we flip 1,024 coins, then we expect one coin to come up all tails, on average). If one of our coins comes up all tails, then we might therefore conclude that this particular coin is not fair. In fact, a standard hypothesis test for the null hypothesis that this particular coin is fair would lead to a p-value below \(0.002^{12}\) But it would be incorrect to conclude that the coin is not fair: in fact, the null hypothesis holds, and we just happen to have gotten ten tails in a row by chance.

Code
# Load necessary libraries
library(tidyverse)

# Set seed for reproducibility
set.seed(1)

# Function to simulate coin flips
flip_coins <- function(num_coins, num_flips) {
  results <- matrix(sample(c("H", "T"), num_coins * num_flips, replace = TRUE), ncol = num_flips)
  apply(results, 1, function(coin) all(coin == "T"))
}

# Parameters
num_coins <- 1024
num_flips <- 10

# Simulate the coin flips
all_tails <- flip_coins(num_coins, num_flips)

# Count how many coins came up all tails
num_all_tails <- sum(all_tails)

# Probability of getting all tails in 10 flips of a fair coin
prob_all_tails <- 0.5^10

# Output results
cat("Number of coins that came up all tails:", num_all_tails, "\n")
Number of coins that came up all tails: 1 
Code
# Visualize the distribution of results
results_df <- data.frame(
  Coin = 1:num_coins,
  All_Tails = all_tails
)

ggplot(results_df, aes(x = Coin, y = as.numeric(All_Tails))) +
  geom_point() +
  labs(title = "Distribution of Coins Coming Up All Tails",
       x = "Coin Number",
       y = "All Tails (1 = Yes, 0 = No)") +
  theme_minimal()

When testing a huge number of null hypotheses, we are bound to get some very small p-values by chance. If we make a decision about whether to reject each null hypothesis without accounting for the fact that we have performed a very large number of tests, then we may end up rejecting a great number of true null hypotheses.

If we choose to reject the null hypothesis when the \(p\)-value is less than 0.01, then there is a 1% chance of making a false rejection. So if we have \(m\) number of hypotheses we want to test, our probability of falsely rejecting the null hypothesis will be \(m\times 0.01\) and that will be our number of Type I errors. How can we control fot this situation?

15.1 Family-Wise Error Rate (FWER)

The family-wise rate is the probability of making at least one Type I error when conducting hypothesis tests.

When performing the test, we know how many hypotheses we have tested \(m\), we also know how many \(H_0\) we have rejected, but we don’t know if we rejected them correctly or not, but we can approximate that. \[ FWER = 1 - (1 - \alpha)^m \]

If the null hypothesis is true for each of \(m\) independent hypothesis tests, then the FWER is equal to \(1-(1-\alpha)^m\). We can use this expression to compute the FWER for \(m=1,\ldots, 500\) and \(\alpha=0.05\), \(0.01\), and \(0.001\).

Code
m <- 1:500
fwe1 <- 1 - (1 - 0.05)^m
fwe2 <- 1 - (1 - 0.01)^m
fwe3 <- 1 - (1 - 0.001)^m

We plot these three vectors. The red, blue, and green lines correspond to \(\alpha=0.05\), \(0.01\), and \(0.001\), respectively.

Code
par(mfrow = c(1, 1))
plot(m, fwe1, type = "l", log = "x", ylim = c(0, 1), col = 2,
    ylab = "Family - Wise Error Rate",
    xlab = "Number of Hypotheses")
lines(m, fwe2, col = 4)
lines(m, fwe3, col = 3)
abline(h = 0.05, lty = 2)

Even for moderate values of \(m\) such as \(50\), the FWER exceeds \(0.05\) unless \(\alpha\) is set to a very low value, such as \(0.001\). Of course, the problem with setting \(\alpha\) to such a low value is that we are likely to make a number of Type II errors: in other words, our power is very low.

15.2 Bonferroni Method for Controlling FWER

The Bonferroni method adjusts the significance level (𝛼) for each individual hypothesis test to ensure that the overall probability of making at least one Type I error (false positive) remains controlled. Given a number of hypotheses being tested \(m\), we adjust the significance level for each test \(\alpha_{adj}\) by dividing the desired significance level \(\alpha\) byt the number of hypotheses \(m\)

\[ \alpha_{adj} = \frac{\alpha}{m} \] Now we compare the p-values of each test against the adjusted significance level and reject those null hypothesis for any test with a \(p\)-value less than or equal to \(\alpha_{adj}\)

We now consider the Fund dataset. It contains data for 50 months for a list of managers and their excess return, this is, the percentage points difference between the funds managed by these people compared with the stock market in general for the same period. In the table below we show the stats for the first 5 managers in our list.

If we control the Type I error at level α = 0.05 for each fund manager separately, then we will conclude that the first and third managers have significantly non-zero excess returns; in other words, we will reject H01 : μ1 = 0 and H03 : μ3 = 0. However, as discussed in previous sections, this procedure does not account for the fact that we have tested multiple hypotheses, and therefore it will lead to a FWER greater than 0.05. If we instead wish to control the FWER at level 0.05, then, using a Bonferroni correction, we must control the Type I error for each individual manager at level α/m = 0.05/5 = 0.01. Consequently, we will reject the null hypothesis only for the first manager, since the p-values for all other managers exceed 0.01.

We are going to do the calculation manually here, and later in the exercises we will use an R function that calculate this for us.

Code
fund <- ISLR2::Fund
managers <- numeric(0)
pvals <- numeric(0)
tstat <- numeric(0)
means <- numeric(0)
sds <- numeric(0)
for (i in 1:5){
  testResult <- t.test(fund[,i], mu = 0) 
managers <- c(managers, i)
pvals<- c(pvals, round(testResult$p.value,3))
tstat<- c(tstat, round(testResult$statistic,2))  
means<- c(means, round(mean(fund[,i]),1))
sds <- c(sds, round(sd(fund[,i]),1))
}
df<- data.frame(manager = managers, mean = means, StandDev = sds,  tStatistic = tstat,pValues = pvals)
df
  manager mean StandDev tStatistic pValues
1       1  3.0      7.4       2.86   0.006
2       2 -0.1      6.9      -0.10   0.918
3       3  2.8      7.5       2.62   0.012
4       4  0.5      6.7       0.53   0.601
5       5  0.3      6.8       0.31   0.756

The Bonferroni correction is by far the best-known and most commonly used multiplicity correction in all of statistics. Its ubiquity is due in large part to the fact that it is very easy to understand and simple to implement, and also from the fact that it successfully controls Type I error regardless of whether the m hypothesis tests are independent. However, as we will see, it is typically neither the most powerful nor the best approach for multiple testing correction. In particular, the Bonferroni correction can be quite conservative, in the sense that the true FWER is often quite a bit lower than the nominal (or target) FWER.

15.3 Holm’s Method for controlling FWER

For this method we start like with the Bonferroni, we just do our test and calculate our \(p\)-values as per normal, and then we sort those \(p\)-values from smallest to greatest, and one by one we calculate L that validates this formula

\[ L = \min \left\{ j : p_{j} > \frac{\alpha}{m+1-j} \right\} \] where \(j\) is the index of the hypotheses tested, so \(j= 1,2,\dots,m\)

Once we have L, we reject all null hypothesis that are below L.

Code
df <- df %>% arrange(df$pValues)
alpha <- 0.05
m = 5
j=1

for (i in df$pValues){
  df$reject = df$pValues < (alpha/(m+1-j))
  j=j+1
  
}
df
  manager mean StandDev tStatistic pValues reject
1       1  3.0      7.4       2.86   0.006   TRUE
2       3  2.8      7.5       2.62   0.012   TRUE
3       4  0.5      6.7       0.53   0.601  FALSE
4       5  0.3      6.8       0.31   0.756  FALSE
5       2 -0.1      6.9      -0.10   0.918  FALSE

We see that in this setting, we reject the null hypothesis for managers 1 and 3 using Holm’s method.

Holm’s method will reject at least as many hypothesis than the Bonferroni method, but usually rejects more.

Practice: Type I errors, FWER, Bonferrony, Holm’s

We are going to simulate some random data to test 100 hypotheses to check if their mean is equal to 0. Each dataset contains 10 observations.

We create the data and then modify the first 50 columns so its mean is not 0 but 0.5, so we would expect to reject the first 50 \(H_0\) hypotheses and do not reject the last 50 \(H_0\) hypotheses.

Code
set.seed(6)
X<- matrix(rnorm(10*100), 10,100)
X[,1:50]<- X[,1:50] +0.5

p.values = rep(0,100)
for (i in 1:100)
  p.values[i] <- t.test(X[,i],mu=0)$p.value
decision <- rep("Do not reject H0", 100)
decision[p.values <= .05]<- 'reject H0'
table(decision,
  c(rep("H0 is False", 50), rep("H0 is true",50)))
                  
decision           H0 is False H0 is true
  Do not reject H0          40         47
  reject H0                 10          3

The situation in which we rejected the null hypothesis when it is in fact true are our Type I errors, in our case 3.

With only 10 observations, we could only reject 10 out of the 50 false null hypotheses, specially because our mean was only 0.5 difference. We can see how if we change the mean of the first 50 hypotheses to 1 instead of 0.5, we have a better success on the number of Type II errors:

Code
X<- matrix(rnorm(10*100), 10,100)
X[,1:50]<- X[,1:50] +1

p.values = rep(0,100)
for (i in 1:100)
  p.values[i] <- t.test(X[,i],mu=0)$p.value
decision <- rep("Do not reject H0", 100)
decision[p.values <= .05]<- 'reject H0'
table(decision,
  c(rep("H0 is False", 50), rep("H0 is true",50)))
                  
decision           H0 is False H0 is true
  Do not reject H0           9         49
  reject H0                 41          1

As we have done in the lecture above, we now conduct a one-sample \(t\)-test for each of the first five managers in the Fund dataset, in order to test the null hypothesis that the \(j\)th fund manager’s mean return equals zero, \(H_{0j}: \mu_j=0\).

Code
fund.mini <- Fund[, 1:5]
t.test(fund.mini[, 1], mu = 0)

    One Sample t-test

data:  fund.mini[, 1]
t = 2.8604, df = 49, p-value = 0.006202
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.8923397 5.1076603
sample estimates:
mean of x 
        3 
Code
fund.pvalue <- rep(0, 5)
for (i in 1:5)
  fund.pvalue[i] <- t.test(fund.mini[, i], mu = 0)$p.value
fund.pvalue
[1] 0.006202355 0.918271152 0.011600983 0.600539601 0.755781508

The \(p\)-values are low for Managers One and Three, and high for the other three managers. However, we cannot simply reject \(H_{01}\) and \(H_{03}\), since this would fail to account for the multiple testing that we have performed. Instead, we will conduct Bonferroni’s method and Holm’s method to control the FWER.

Bonferroni

To do this, we use the p.adjust() function. Given the \(p\)-values, the function outputs adjusted \(p\)-values, which can be thought of as a new set of \(p\)-values that have been corrected for multiple testing. If the adjusted \(p\)-value for a given hypothesis is less than or equal to \(\alpha\), then that hypothesis can be rejected while maintaining a FWER of no more than \(\alpha\). In other words, the adjusted \(p\)-values resulting from the p.adjust() function can simply be compared to the desired FWER in order to determine whether or not to reject each hypothesis.

For example, in the case of Bonferroni’s method, the raw \(p\)-values are multiplied by the total number of hypotheses, \(m\), in order to obtain the adjusted \(p\)-values. (However, adjusted \(p\)-values are not allowed to exceed \(1\).)

Code
p.adjust(fund.pvalue, method = "bonferroni")
[1] 0.03101178 1.00000000 0.05800491 1.00000000 1.00000000
Code
pmin(fund.pvalue * 5, 1)
[1] 0.03101178 1.00000000 0.05800491 1.00000000 1.00000000

Therefore, using Bonferroni’s method, we are able to reject the null hypothesis only for Manager One while controlling the FWER at \(0.05\).

Holm’s method

By contrast, using Holm’s method, the adjusted \(p\)-values indicate that we can reject the null hypotheses for Managers One and Three at a FWER of \(0.05\).

Code
p.adjust(fund.pvalue, method = "holm")
[1] 0.03101178 1.00000000 0.04640393 1.00000000 1.00000000

As discussed previously, Manager One seems to perform particularly well, whereas Manager Two has poor performance.

Code
apply(fund.mini, 2, mean)
Manager1 Manager2 Manager3 Manager4 Manager5 
     3.0     -0.1      2.8      0.5      0.3 

Tukey’s Honest Significant Difference (HSD) test

Is there evidence of a meaningful difference in performance between these two managers? Performing a paired \(t\)-test using the t.test() function results in a \(p\)-value of \(0.038\), suggesting a statistically significant difference.

Code
t.test(fund.mini[, 1], fund.mini[, 2], paired = T)

    Paired t-test

data:  fund.mini[, 1] and fund.mini[, 2]
t = 2.128, df = 49, p-value = 0.03839
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 0.1725378 6.0274622
sample estimates:
mean difference 
            3.1 

However, we decided to perform this test only after examining the data and noting that Managers One and Two had the highest and lowest mean performances.

In a sense, this means that we have implicitly performed \({5 \choose 2} = 5(5-1)/2=10\) hypothesis tests, rather than just one. Hence, we use the TukeyHSD() function to apply Tukey’s method in order to adjust for multiple testing. This function takes as input the output of an ANOVA regression model, which is essentially just a linear regression in which all of the predictors are qualitative. In this case, the response consists of the monthly excess returns achieved by each manager, and the predictor indicates the manager to which each return corresponds.

Code
returns <- as.vector(as.matrix(fund.mini))
manager <- rep(c("1", "2", "3", "4", "5"), rep(50, 5))
a1 <- aov(returns ~ manager)
TukeyHSD(x = a1)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = returns ~ manager)

$manager
    diff        lwr       upr     p adj
2-1 -3.1 -6.9865435 0.7865435 0.1861585
3-1 -0.2 -4.0865435 3.6865435 0.9999095
4-1 -2.5 -6.3865435 1.3865435 0.3948292
5-1 -2.7 -6.5865435 1.1865435 0.3151702
3-2  2.9 -0.9865435 6.7865435 0.2452611
4-2  0.6 -3.2865435 4.4865435 0.9932010
5-2  0.4 -3.4865435 4.2865435 0.9985924
4-3 -2.3 -6.1865435 1.5865435 0.4819994
5-3 -2.5 -6.3865435 1.3865435 0.3948292
5-4 -0.2 -4.0865435 3.6865435 0.9999095

The TukeyHSD() function provides confidence intervals for the difference between each pair of managers (lwr and upr), as well as a \(p\)-value. All of these quantities have been adjusted for multiple testing. Notice that the \(p\)-value for the difference between Managers One and Two has increased from \(0.038\) to \(0.186\), so there is no longer clear evidence of a difference between the managers’ performances. We can plot the confidence intervals for the pairwise comparisons using the plot() function.

Code
plot(TukeyHSD(x = a1))

15.4 Benjamini-Hochberg Procedure to control

The False Discovery Rate When \(m\) is large, then trying to prevent any false positives (as in FWER control) is simply too stringent. Instead, we might try to make sure that the ratio of false positives to total positives (is sufficiently low, so that most of the rejected null hypotheses are not false positives. The ratio is known as the false discovery proportion (FDP).

It might be tempting to ask the data analyst to control the FDP: to make sure that no more than, say, 20% of the rejected null hypotheses are false positives. However, in practice, controlling the FDP is an impossible task for the data analyst, since she has no way to be certain, on any particular dataset, which hypotheses are true and which are false, Therefore, we instead control the false discovery rate (FDR) defined as FDR = E(FDP) = E(False Rejections/Total Rejections) When we control the FDR at (say) level q = 20%, we are rejecting as many null hypotheses as possible while guaranteeing that no more than 20% of those rejected null hypotheses are false positives, on average. The expectation is taken over the population from which the data are generated. For instance, suppose we control the FDR for \(m\) null hypotheses at \(q = 0.2\). This means that if we repeat this experiment a huge number of times, and each time control the FDR at \(q = 0.2\), then we should expect that, on average, 20% of the rejected null hypotheses will be false positives. On a given dataset, the fraction of false positives among the rejected hypotheses may be greater than or less than 20%.

It is worth noting that unlike \(p\)-values, for which a threshold of 0.05 is typically viewed as the minimum standard of evidence for a “positive” result, and a threshold of 0.01 or even 0.001 is viewed as much more compelling, there is no standard accepted threshold for FDR control. Instead, the choice of FDR threshold is typically context-dependent, or even dataset dependent.

This method is useful when we have a large number of hypothesis to test and we are ok with rejecting a percentage of true null Hypotheses.

We need some way to connect the \(p\)-values, \(p1,\dots,pm\), from the \(m\) null hypotheses to the desired FDR value, \(q\). It turns out that a very simple procedure can be used to control the FDR

  • First we specify \(q\) that is the level at which we wish to control FDR

  • Compute the \(p\)-values for each of our hypotheses - Order the \(p\)-values from smallest to greatest - Define L where \[ L= max \{ j:p_{(j)} < qj/m\} \]

  • Reject all null hypotheses \(H_{0j}\) for which \(p_j \leq p_{(L)}\)

We are going to use now the full dataset of Fund with 2000 managers and apply the Bonferroni method and the Benjamini-Hochberg method. First, we are going to do the calculation manually and later we will use a function.

Code
fund <- ISLR2::Fund
managers <- numeric(0)
pvals <- numeric(0)
bonfRej <- numeric(0)
alpha = 0.05
m= length(fund)

#bonferroni rejections
for (i in 1:m){
  testResult <- t.test(fund[,i], mu = 0) 
managers <- c(managers, i)
pvals<- c(pvals, round(testResult$p.value,3))
bonf <- testResult$p.value <= (alpha/m)
bonfRej <- c(bonfRej, bonf)
}
df<- data.frame(manager = managers, pValues = pvals, bonfRej = bonfRej)


#Benjamini-Hochberg rejections

ps <- sort(df$pValues)
q <- 0.1
wh.ps <- which(ps < q * (1:m) / m)
if (length(wh.ps) >0) {
  wh <- 1:max(wh.ps)
 } else {
  wh <- numeric(0)
 }


plot(ps, log = "xy", ylim = c(4e-6, 1), ylab = "P-Value",
    xlab = "Index", main = "")
points(wh, ps[wh], col = 4)
abline(a = 0, b = (q / m), col = 2, untf = TRUE)
abline(h = 0.1 / 2000, col = 3)

Code
sum(df$bonfRej)
[1] 0
Code
max(wh)
[1] 147

The graph shows in blue the rejected hypotheses using BH

Due to the large number of \(m\), the Bonferroni method does not reject any hypothesis, while Benjamin-Hochberg rejects 147, but we have to remember that we expect a percentage of them will be false rejections at our level of \(q\).

Now we will calculate the Benjamini-Hochberg using p.adjust() function in R. We will show the first 10

Code
q.values.BH <- p.adjust(df$pValues, method = "BH")
q.values.BH[1:10]
 [1] 0.08633094 0.99189627 0.12244898 0.92319508 0.95635674 0.06521739
 [7] 0.07575758 0.06521739 0.06521739 0.06521739

The q-values output by the Benjamini-Hochberg procedure can be interpreted as the smallest FDR threshold at which we would reject a particular null hypothesis. For instance, a \(q\)-value of \(0.1\) indicates that we can reject the corresponding null hypothesis at an FDR of \(10\%\) or greater, but that we cannot reject the null hypothesis at an FDR below \(10\%\).

If we control the FDR at \(10\%\), then for how many of the fund managers can we reject \(H_{0j}: \mu_j=0\)?

Code
sum(q.values.BH <= .1)
[1] 147

We find that 147 of the 2,000 fund managers have a \(q\)-value below 0.1; therefore, we are able to conclude that 147 of the fund managers beat the market at an FDR of \(10\%\). Only about 15 (\(10\%\) of 147) of these fund managers are likely to be false discoveries.

15.5 Resampling approaches

So far, we have assumed that we want to test some null hypothesis with some statistic \(T\) and that we know, or can assume, the distribution of \(T\) under \(H_0\), this allows us to compute the \(p\)-value, but what if this theoretical null distribution is unknown?

We are only going to see this approach for a two-sample \(t\)-Test, and this method cannot be automatically used for other types of tests.

We have two samples \(X\) and \(Y\) and the \(H_0\) is that the means of thos two samples is the same. As we have seen in the central limit theorem if \(N\) is big enough, we know that \(T\) approximately follows a normal distribution under \(H_0\), but if \(N\) is small, then we don’t know the theoretical null distribution of \(T\), so we can can use re sampling and let the computer find out the right distribution.

  • Compute the two-sample \(t\)-statistic \(T\) on the original data \(X\) and \(Y\).
  • choose a large number (like 1000 or higher) \(B\)
  • Randomly shuffle the \(X\) and \(Y\) observations and select two samples \(x^*\) and \(y^*\)
  • Compute a two sample \(t\)-statistic on the shuffled data \(T^*\)
  • repeat the two last steps \(B\) times.
  • the \(p\)-value is given by the fraction of the times that the \(t\)-statistic on the shuffled data exceeded the absolute value, the \(t\)-statistic on the real data.

\[ p=\frac{ \sum^B_{b=1}1_{(|T^{*b}| \geq |T|)}}{B} \] - now we can compare this calculated \(p\)-value with that one calculated at the beginning on the original two sample test.

The reason why we work with absolute values for the \(T\) is because in this particular case it is a two-sided \(t\)-test so we are interested in whether \(X\) is larger than \(Y\) or vice-versa but if it’s a one side \(t\)-test then we can get rid of the absolute values.

Practice

Here, we implement the re-sampling approach to hypothesis testing using the Khan dataset. We merge together train and test datasets because we are not using that here and we do the same for the response \(y\) that gives us the type of cancer.

First, we merge the training and testing data, which results in observations (rows) on \(83\) patients for \(2{,}308\) columns. Each column represent a gene. So we are going to perform \(2{,}308\) hypothesis test here.

Code
attach(Khan)
x <- rbind(xtrain, xtest)
y <- c(as.numeric(ytrain), as.numeric(ytest))
dim(x)
[1]   83 2308
Code
table(y)
y
 1  2  3  4 
11 29 18 25 

There are four classes of cancer.To simplify this example we are only going to compare two of the 4 classes only. For each gene, we compare the mean expression in the second class (rhabdomyosarcoma) to the mean expression in the fourth class (Burkitt’s lymphoma). We see in the table above that we have 29 observations for class 2 and 25 for class 4.

Performing a standard two-sample \(t\)-test on the \(11\)th gene produces a test-statistic of \(-2.09\) and an associated \(p\)-value of \(0.0412\), suggesting modest evidence of a difference in mean expression levels between the two cancer types.

Code
x <- as.matrix(x)
x1 <- x[which(y == 2), ]
x2 <- x[which(y == 4), ]
n1 <- nrow(x1)
n2 <- nrow(x2)
t.out <- t.test(x1[, 11], x2[, 11], var.equal = TRUE)
TT <- t.out$statistic
TT
        t 
-2.093633 
Code
t.out$p.value
[1] 0.04118644

However, this \(p\)-value relies on the assumption that under the null hypothesis of no difference between the two groups, the test statistic follows a \(t\)-distribution with \(29+25-2=52\) degrees of freedom. Instead of using this theoretical null distribution, we can randomly split the 54 patients into two groups of 29 and 25, and compute a new test statistic. Under the null hypothesis of no difference between the groups, this new test statistic should have the same distribution as our original one. Repeating this process 10,000 times allows us to approximate the null distribution of the test statistic. We compute the fraction of the time that our observed test statistic exceeds the test statistics obtained via re-sampling.

Code
set.seed(1)
B <- 10000
Tbs <- rep(NA, B) #initializing an empty vector for the T statistic
for (b in 1:B) {
   dat <- sample(c(x1[, 11], x2[, 11])) #reshufle the 11th gene for the two types of cancer
   Tbs[b] <- t.test(dat[1:n1], 
                    dat[(n1 + 1):(n1 + n2)],
        var.equal = TRUE
      )$statistic
}
mean((abs(Tbs) >= abs(TT))) #calculate how many T's are higher than the original
[1] 0.0416

This fraction, \(0.0416\), is our re-sampling-based \(p\)-value. It is almost identical to the \(p\)-value of \(0.0412\) obtained using the theoretical null distribution. This means that the assumptions we made in the original test hold for this data

We can plot a histogram of the re-sampling-based test statistics.

Code
hist(Tbs, breaks = 100, xlim = c(-4.2, 4.2), main = "",
    xlab = "Null Distribution of Test Statistic", col = 7)

lines(seq(-4.2, 4.2, len = 1000),
      dt(seq(-4.2, 4.2, len = 1000),
      df = (n1 + n2 - 2)
      ) * 1000, col = 2, lwd = 3)
abline(v = TT, col = 4, lwd = 2)
text(TT + 0.5, 350, paste("T = ", round(TT, 4), sep = ""),
    col = 4)

In yellow we have the resampling null distribution and in red the theoretical \(t\) distribution for mean 0, sd 1 and 52 degrees of freedom.

The re-sampling-based null distribution is almost identical to the theoretical null distribution, which is displayed in red, so it is to no surprise that the theoretical \(p\)-value from the test is almost the same as the resampling \(p\)-value. This just confirms that it is reasonable to use the \(t\)-distribution here.

Finally, we implement the plug-in re-sampling FDR approach outlined. Depending on the speed of your computer, calculating the FDR for all 2,308 genes in the Khan dataset may take a while. Hence, we will illustrate the approach on a random subset of 100 genes. For each gene, we first compute the observed test statistic, and then produce \(10{,}000\) re-sampled test statistics. This may take a few minutes to run. If you are in a rush, then you could set B equal to a smaller value (e.g. B = 500).

Code
m <- 100
B <- 1000
set.seed(1)
index <- sample(ncol(x1), m)
Ts <- rep(NA, m)
Ts.star <- matrix(NA, ncol = m, nrow = B)
for (j in 1:m) {
  k <- index[j]
  Ts[j] <- t.test(x1[, k], x2[, k],
        var.equal = TRUE
      )$statistic
  for (b in 1:B) {
    dat <- sample(c(x1[, k], x2[, k]))
    Ts.star[b, j] <- t.test(dat[1:n1],
                           dat[(n1 + 1):(n1 + n2)], 
                           var.equal = TRUE
       )$statistic
  }
}

Next, we compute the number of rejected null hypotheses \(R\), the estimated number of false positives \(\widehat{V}\), and the estimated FDR, for a range of threshold values \(c\). The threshold values are chosen using the absolute values of the test statistics from the \(100\) genes.

Code
cs <- sort(abs(Ts))
FDRs <- Rs <- Vs <- rep(NA, m)
for (j in 1:m) {
  R <- sum(abs(Ts) >= cs[j])
  V <- sum(abs(Ts.star) >= cs[j]) / B
  Rs[j] <- R
  Vs[j] <- V
  FDRs[j] <- V / R
}

Now, for any given FDR, we can find the genes that will be rejected. For example, with the FDR controlled at 0.1, we reject 15 of the 100 null hypotheses. On average, we would expect about one or two of these genes (i.e. \(10\%\) of 15) to be false discoveries.

At an FDR of \(0.2\), we can reject the null hypothesis for \(28\) genes, of which we expect around six to be false discoveries. The variable index is needed here since we restricted our analysis to just \(100\) randomly-selected genes.

Code
max(Rs[FDRs <= .1])
[1] 15
Code
sort(index[abs(Ts) >= min(cs[FDRs < .1])])
 [1]   29  465  501  554  573  729  733 1301 1317 1640 1646 1706 1799 1942 2159
Code
max(Rs[FDRs <= .2])
[1] 28
Code
sort(index[abs(Ts) >= min(cs[FDRs < .2])])
 [1]   29   40  287  361  369  465  501  554  573  679  729  733  990 1069 1073
[16] 1301 1317 1414 1639 1640 1646 1706 1799 1826 1942 1974 2087 2159
Code
plot(Rs, FDRs, xlab = "Number of Rejections", type = "l",
    ylab = "False Discovery Rate", col = 4, lwd = 3)

As noted in the chapter, much more efficient implementations of the re-sampling approach to FDR calculation are available, using e.g. the samr package in R.

Source Code
---
title: "Assessing Model Accuracy"
format: html
editor: visual
---

```{r}
#| echo: false
library(dplyr)
library(tidyverse)
#library(here)
library(readxl)
library(easystats)
library(infer)
library(kableExtra)
library(plotly)
library(ggplot2)
#library(patchwork)
library(BSDA) 
library(MASS)
library(rafalib)
library(UsingR) #datasets
library(ISLR2) #datasets
library(scatterplot3d)
library(gridExtra)
library(caret) #confusionMatrix
library(pROC)
library(class)
library(boot) #crossvalidation
library(leaps) #best subset selection
library(glmnet) #ridge regression and lasso
#library(survival) #survival 
#library(survminer) #survival ggplots
#library(splines) #splines 
theme_set(theme_minimal())
options(scipen= 999)
```

In this section, we discuss some of the most important concepts that arise in selecting a statistical learning procedure for a specific data set.

# Measuring the Quality of Fit. Mean Squared Error (MSE) {#MeanSquaredError}

In order to evaluate the performance of a statistical method on a given dataset, we need some way to measure how well its predictions actually match the observed data. In regression, the most commonly-used measure is the *mean squared error (MSE)*

$$
MSE = \frac{1}{n}\sum^n_{i=1}(y_i-\hat{f}(x_i))^2 =\frac{1}{n}\sum^n_{i=1}(y_i-\hat{y_i})^2
$$ {#eq-MSEMeanSquaredError}

that is the squared difference between the observed values and the predicted values divided by the number of data points.

If we remember the formula for the [RSS](/linearModels.html#eq-RSSsimpleRegression) we see that MSE formula can be written as:

$$
MSE = \frac{1}{n} RSS
$$

The MSE is computed using the training data that was used to fit the model, and should be referred as training MSE, but we are more interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data. We want to choose the method that gives the lowest test MSE, as opposed to the lowest training MSE. There is no guarantee that the method with the lowest training MSE will also have the lowest test MSE. Roughly speaking, the problem is that many statistical methods specifically estimate coefficients so as to minimize the training set MSE. For these methods, the training set MSE can be small, but the test MSE is often much larger.

```{r, fig.align='center', echo=FALSE}

set.seed(123)
X <- seq(0, 10, length.out = 30)
Y <- 2 + 5 * X + X^2 + rnorm(30, sd = 5)
data <- data.frame(X, Y)

# Fit models
linear_model <- lm(Y ~ X, data = data)
parametric_model <- lm(Y ~ poly(X, 5), data = data)  # Increased polynomial degree
overfitted_model <- lm(Y ~ poly(X, 20), data = data)

# Predict values
data <- data %>%
  mutate(
    linear_pred = predict(linear_model),
    polynomial_pred = predict(parametric_model),
    overfitted_pred = predict(overfitted_model)
  )

# Plot
ggplot(data, aes(x = X, y = Y)) +
  geom_point() +
  geom_line(aes(y = linear_pred), color = "blue") +
  geom_line(aes(y = polynomial_pred), color = "red") +
  geom_line(aes(y = overfitted_pred), color = "green") +
  labs(title = "Scatter Plot with 3 different Regression Lines", x = "X", y = "Y")
```

the graph above shows a linear regression line, and two polynomial lines one of them over fitted. If we calculate the Mean Square Error of each regression we can see that the over-fit model is not the best predictor over new data:

```{r, fig.align='center'}
# Function to calculate MSE
calculate_mse <- function(model, new_data) {
  mean((new_data$Y - predict(model, new_data))^2)
}

# Generate new training data
set.seed(456)
X <- seq(0, 10, length.out = 30)
Y <- 2 + 5 * X + X^2 + rnorm(30, sd = 5)
new_data <- data.frame(X, Y)

# Calculate MSE for each model
mse_linear <- calculate_mse(linear_model, new_data)
mse_parametric <- calculate_mse(parametric_model, new_data)
mse_overfitted <- calculate_mse(overfitted_model, new_data)

# Create a dataframe for MSE values
mse_data <- data.frame(
  Model = c("Linear", "Polynomial", "Overfitted"),
  MSE = c(mse_linear, mse_parametric, mse_overfitted)
)

# Plot MSE values
ggplot(mse_data, aes(x = Model, y = MSE, fill = Model)) +
  geom_bar(stat = "identity") +
  labs(title = "MSE for Different Models on New Training Data", x = "Model", y = "Mean Squared Error") +
  theme_minimal()

```

# Bias-Variance Trade-off {#bias-variance-trade-off}

Imagine you're trying to build a model to predict house prices. You could use a simple model or a complex one. The bias-variance trade-off is all about balancing simplicity and complexity.

Bias is error introduced by assuming a model is too simple, capturing only a rough approximation of the real relationship. High bias means the model might miss important patterns (underfitting). Think of a straight line when a curve is more appropriate.

Variance is error introduced by the model being too complex, capturing noise instead of the actual pattern. High variance means the model is too sensitive to small fluctuations in the training data, leading to poor performance on new data (overfitting). Imagine fitting every little bump in the data.

The trade-off part comes in because as you try to decrease bias by making the model more complex, you often increase variance, and vice versa. The goal is to find the sweet spot where the model captures the essential patterns without overreacting to noise.

Low Bias, High Variance: A very flexible model that might fit the training data really well but fails on new data.

High Bias, Low Variance: A very simple model that doesn't capture the data well, missing patterns.

Optimal Bias-Variance Trade-Off: A balanced model that captures the underlying trend without overfitting.

# R-squared {#Rsquare}

We already talked about [how to assess the slope of an individual predictor](#MeanSquaredError)

The [Residual Standard Error (RSE)](#ResidualStandardError) provides an absolute measure of lack of fit of the model to the data. But since it is measured in the units of Y , it is not always clear what constitutes a good RSE. The $R^2$ statistic provides an alternative measure of fit. It takes the form of a proportion ---the proportion of variance explained--- and so it always takes on a value between 0 and 1, and is independent of the scale of Y .

We already saw the [residual sum of squared (RSS)](/linearModels.html#Residualsumofsquares) RSS is a measure of the discrepancy between the actual data and the model's predictions. It represents the sum of the squared differences between the observed values and the predicted values.

$$
RSS=\sum^n_{i=1}(y_1-\hat{y_1})^2 
$$ RSS measures the amount of variability that is left unexplained after performing the regression.

The **total Sum of Squares (TSS)**: TSS is a measure of the total variability in the observed data. It represents the sum of the squared differences between the observed values and the mean of the observed values. TSS measures the total variance in the response Y and can be thourhg of as the amount of variability inherent in the response before the regression is performed.

$$
TSS=\sum^n_{i=1}(y_1-\bar{y_1})^2
$$ {#eq-tss}

TSS - RSS measures the amount of variability in the response that is explained (or removed) by performing the regression, and $R^2$ measures the proportion of variability in Y that can be explained using X.

$$
R^2 = \frac{TSS-RSS}{TSS}= 1-\frac{RSS}{TSS}
$$ {#eq-rsquared}

The R-squared value measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It's calculated as:

A larger R-squared value indicates a better fit of the model to the data. An R-squared value closer to 1 means that the model explains a large proportion of the variance in the dependent variable, signifying a good fit. Conversely, a smaller R-squared value indicates that the model explains less of the variance, signifying a poorer fit, or the error variance is high, or both.

However, it can still be challenging to determine what is a good R2 value, and in general, this will depend on the application. For instance, in certain problems in physics, we may know that the data truly comes from a linear model with a small residual error. In this case, we would expect to see an R2 value that is extremely close to 1, and a substantially smaller R2 value might indicate a serious problem with the experiment in which the data were generated. On the other hand, in typical applications in biology, psychology, marketing, and other domains, the linear model is at best an extremely rough approximation to the data, and residual errors due to other unmeasured factors are often very large. In this setting, we would expect only a very small proportion of the variance in the response to be explained by the predictor, and an R2 value well below 0.1 might be more realistic.

The $R^2$ statistic is a measure of the linear relationship between X and Y, and if we recall [correlation](@/linearModels.html#correlation-coefficient), it also measures the linear relationship between X and Y, this suggest that we might be able to use correlation $r-Cor(X,Y)$ instead of $R^2$ in order to assess the fit of the model, and in fact, it can be shown that in the simple linear regression setting, $R^2 = r^2$.

# Residual Standard Error (RSE) {#ResidualStandardError}

Another measure of the model fit is the **Residual Standard Error (RSE)** The RSE is an estimate of the standard deviation of the error term, representing the average amount that the observed values deviate from the model's predictions. Essentially, it gives you an idea of the typical size of the residuals (errors) made by the model. It's like a measure of how well the model fits the data.

$$
RSE = \sqrt{\frac{1}{n-p-1}RSS}
$$ {#eq-RSE}

where $n$ is the number of observations and $p$ is the number of predictors or independent variables in the model. $n-p-1$ is the *degrees of freedom*. A smaller RSE indicates a better fit, implying that the model's predictions are close to the actual data. - RSS helps calculate RSE, giving you the scale of the errors. - TSS is used to calculate $R^2$, which, together with RSE, helps you understand the overall model fit.

Both RSE and $R^2$ provide insight into the model's performance, but from slightly different angles. While $R^2$ tells you how much variance in the dependent variable is explained by the model, RSE tells you about the typical size of the prediction errors.

By using these metrics together, you can get a comprehensive view of how well your regression model is performing

# Multiple Linear regression.

When we have more than one predictor, the model will create a plane instead of a curve that tries to minimize the sum of squared errors:

```{r, fig.align='center', echo=FALSE}
# Set seed for reproducibility
set.seed(123)

# Generate sample data with a stronger linear relationship
n <- 100
X1 <- runif(n, 0, 10)
X2 <- runif(n, 0, 10)
Y <- 5 + 1.5*X1 + 2*X2 + rnorm(n, sd = 2)  # Adjusted coefficients and smaller noise

# Create data frame
data <- data.frame(X1, X2, Y)

# Fit the linear model
model <- lm(Y ~ X1 + X2, data = data)

# Predict values for plotting the hyperplane
x1_seq <- seq(min(data$X1), max(data$X1), length.out = 30)
x2_seq <- seq(min(data$X2), max(data$X2), length.out = 30)
grid <- expand.grid(X1 = x1_seq, X2 = x2_seq)
grid$Y_pred <- predict(model, newdata = grid)

p <- plot_ly(data, x = ~X1, y = ~X2, z = ~Y, type = 'scatter3d', mode = 'markers', marker = list(size = 3)) %>%
  add_trace(x = grid$X1, y = grid$X2, z = grid$Y_pred, type = 'mesh3d', opacity = 0.5, color = 'blue') %>%
  layout(scene = list(
    xaxis = list(title = 'X1'),
    yaxis = list(title = 'X2'),
    zaxis = list(title = 'Y')
  ))
p
```

Remember the formula for [RSS](linearModels.html#eq-RSSsimpleRegression) for a simple linear model was explained already now we have a similar version including all the predictors:

$$
RSS=\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} \left( y_i - \hat{\beta_0} - \hat{\beta_1} x_{i1} - \hat{\beta_2} x_{i2}\dots... - - \hat{\beta_p} x_{ip}\right)^2
$$ {#eq-RSSmultipleRegression}

This is done using statistical software. The values $\hat{\beta_o} ,\hat{\beta_1},\dots ,\hat{\beta_p}$ that minimize RSS are the **multiple least square regression coefficient estimates**

The ideal scenario is when the predictors are uncorrelated: each coefficient can be estimated and tested separately, a unit change in $X_j$ is associated with a $\beta_j$ change in $Y$ while all the other variables stay fixed.

Correlations amongst predictors cause problems: the variance of all coefficients tends to increase, sometimes dramatically and interpretations become hazardous, when $X_j$ changes, everything else changes. We cannot attribute the change to a specific predictor. For example height and weight are two predictors that are correlated.

When looking at multiple regression models we have to ask ourselves some questions: - Is at least one of the predictors useful in predicting the response - Do all the predictors help to explain Y or only some of them are useful? - How well does the model fit the data?

# Testing error and training error {#testingtrainingerror}

The *training error rate* is the proportion of mistakes that are made if we apply our model to the training data-set.

The *test error* is the average error that results from using a statistical learning method to predict the response on a new observation, that is, a measurement that was not used in training the method.

The test error can be easily calculated if a designated test set is available. Unfortunately, this is usually not the case.In contrast, the training error can be easily calculated by applying the statistical learning method to the observations used in its training. But the training error rate often is quite different from the test error rate, and in particular the former can dramatically underestimate the latter.

# Model fit

Two of the most common numerical measures of model fit are the RSE and $R^2$, the fraction of variance explained. These quantities are computed and interpreted in the same fashion as for simple linear regression. Recall that [in simple regression, $R^2$ is the square of the correlation of the response and the variable](#Rsquare). In multiple linear regression, it turns out that it equals $Cor(Y, \hat{Y} )^2$, the square of the correlation between the response and the fitted linear model. It turns out that $R^2$ will always increase when more variables are introduced, even if those variables are only weekly associated with the response. If we notice that the increase in $R^2$ is minimal after adding another variable, it is probably best to leave that variable out of the model. We can check RSE in combination with $R^2$, as RSE will increase despite RSS increasing as well if the new variable is not adding enough value. Remember the formula for RSE (@eq-RSE):

$$
RSE = \sqrt{\frac{1}{n-p-1}RSS}
$$

Thus, models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p.

::: exercise-box
Example:

We run a campaign to sell a product, and we advertise in three mediums, newspapers, radio and tv. We have the amount spent in each of the mediums and the sales. We will first run a simple linear model over the three variables:

```{r}
Advertising <- readr::read_csv("data/Advertising.csv")
tv<- lm(sales ~ TV, data = Advertising)
summary(tv)

radio<- lm(sales ~ radio, data = Advertising)
summary(radio)

newspaper<- lm(sales ~ newspaper, data = Advertising)
summary(newspaper)
```

We see that independently, all three variables have a $p$-value below or significance level.

If we fit the model to the three variables instead newspaper stops showing as significant.

```{r}
lm.fit<- lm(sales ~ TV+radio+newspaper, data = Advertising)
summary(lm.fit)

```

The [*Adjusted Rsquared*](#adjusted-rsquared) is 0.8956 and the RSE 1.686

If we fit the model with only the two significant variables:

```{r}
lm.fit<- lm(sales ~ TV+radio, data = Advertising)
summary(lm.fit)

```

The *adjusted R-squared* is 0.8962 instead of 0.8956 but the RSE has decreased from 1.686 in the model with three predictors to 1.681 in the simpler model. So, the model sales \~ TV + radio is more parsimonious and just as effective. Plus, it's always good to avoid unnecessary complexity when you can.
:::

Sometimes visualizing the data is helpful:

```{r, fig.align='center', fig.width=10}
# Calculate predicted values
Advertising$predicted_sales <- predict(lm.fit)

# Create 3D scatter plot
s3d <- scatterplot3d(Advertising$TV, Advertising$radio, Advertising$sales, 
                     pch = 16, highlight.3d = TRUE, type = "p", 
                     main = "3D Scatter Plot with Regression Plane",
                     xlab = "TV", ylab = "Radio", zlab = "Sales")

# Add regression plane
s3d$plane3d(lm.fit, col = "lightblue")

# Plot lines from data points to the regression plane
for (i in 1:nrow(Advertising)) {
  s3d$points3d(x = c(Advertising$TV[i], Advertising$TV[i]), 
               y = c(Advertising$radio[i], Advertising$radio[i]), 
               z = c(Advertising$sales[i], Advertising$predicted_sales[i]), 
               type = "l", col = "green")
}

```

# Collinearity {#collinearity}

Collinearity refers to the situation in which two or more predictor variables are closely related to one another. The presence of collinearity can pose problems in the regression context, since it can be difficult to separate out the individual effects of collinear variables on the response. Since collinearity reduces the accuracy of estimates of the regression coefficients, it causes the standard error to grow. Recall that the t-statistic for each predictor is calculated by dividing $\hat{\beta_j}$ by its standard error. Consequently, collinearity results in a decline in the t-statistic. As a result, in the presence of collinearity, we may fail to reject the null hypothesis, this means that the probability of correctly detecting a non-zero coefficient is reduced. A simple way to detect collinearity between two variables is to look at the correlation matrix of the predictors, but it is also possible that there is collinearity between three variables, even if no pair of variables have a high correlation, and this will not be detected in the correlation matrix. For this we can calculate the *variance inflaction factor (VIF)* but this is out of the scope here.

# F-statistic.

$$
F= \frac{TSS-RSS/p}{RSS/(n-p-1)}
$$ {#eq-fStatistic}

The F-statistic in a multiple regression model assesses *the overall significance of the model*. It tells you whether your model's variables are jointly significant in explaining the variation in the dependent variable. High F-statistic: Indicates that the group of predictors has a significant relationship with the dependent variable. The larger the F, the more likely the observed relationship is not due to random chance. Low F-statistic: Suggests that the group of predictors does not have a significant explanatory power.

Whether an F value is high or low is determined by comparing it to a critical value from the F-distribution table, which depends on the degrees of freedom and the chosen significance level (commonly 0.05). Steps to Determine High or Low F-value: - Degrees of Freedom: Calculate the degrees of freedom for the numerator (df1 = number of predictors) and the denominator (df2 = total number of observations - number of predictors - 1).

F-distribution Table: Use the degrees of freedom to find the critical F-value from the F-distribution table at your chosen significance level (usually 0.05).

Compare: If the calculated F-value from your regression output is greater than the critical F-value from the table, it means your predictors are collectively significant (high F-value).

Example: Suppose you have 3 predictors in your model and a sample size of 100. The degrees of freedom are:

df1 (numerator) = 3

df2 (denominator) = 100 - 3 - 1 = 96

Looking up the critical value for df1 = 3 and df2 = 96 at the 0.05 significance level in an F-table, you might find it to be around 2.70.

If your calculated F-value is, say, 5.2, since 5.2 \> 2.70, your model is significant (high F-value).

If it were 1.8, since 1.8 \< 2.70, your model isn't significant (low F-value).

In r, the f-value will also come with its $p$-value, indicating its significance.

# Deciding on the important variables.

With today's big data there is more availability of data than ever, and it is becoming usual to have many many features to fit a model, and model accuracy can be challenging in those situations, specially when we have more features than number of samples (p\>n). By removing irrelevant features we can obtain a model that is more easily interpreted. We will present some approaches for automatically performing *feature selection*.

-   *Subset selection*: We identify a subset of the $p$ predictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables.

-   *Shrinkage*: We fit a model involving all $p$ predictors, but the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage (also known as *regularization*) has the effect of reducing variance and can also perform variable selection.

-   *Dimension reduction*: We project the $p$ predictors into a M-dimensional subspace, where M\<p. This is achieved by computing M different *linear combinations* or *projections*, of the variables. Then these M projections are used as predictors to fit a linear regression model by least squares.

## Best Subset selection

The most direct approach to answer this question is called *all subsets* or *best subsets* regression. We compute the least squares fit for all possible subsets and then choose between them based on some criterion that balances training error with model size. This gets really hard when the number of variables is high, the number of possible models is $2^p$ so for example for p=40 there are over a billion models. Instead, we need an automated approach that searches through a subset of them. We discuss two commonly use approaches next.

Although we are presenting this here in the context of linear models, the same ideas apply to other types of models, such as logistic regression. One difference will be that other models will work with a similar metric to the residual sum of squares that is called *deviance*.

A limitation of the subset selection method is the number of p. Most statistical packages have a limit of 30 to 40 features. Another problem of this method with high number of $p$ is overfitting, so this method is only valid when $p$ is small (\<10).

We start with the *null model* that is a model that contains the intercept but no predictors. The intercept is the mean of $y$ Now you start adding variables one at a time: for each variable you fit $p$ simple linear models each with one of the variables and the intercept, and you look at each of them. After that you choose the variable that results in the lowest RSS (that is the largest $R^2$). Now, having picked that, you fix that variable in the model and now you add one by one all the variables to this model again, and choose the one that best improve the residual sum of squares.

You can continue like this until some stopping rule is satisfied, for example, when all the remaining variables have a $p$ value above some threshold.

Once we have a correct combination of predictors, to select if the best model is the model with 2,3,or n predictors, we cannot use RSS anymore because naturally RSS will decrease as we include more predictors, but that does not necessarily means it's a best model. We need to select the model that will have the smallest *test error*, and not the one with the smallest *training error*. Remember [we talked about these concepts already](#testingtrainingerror). Frequently for this we use [cross-validation](#crossvalidation)

::: exercise-box
Example

Here we apply the best subset selection approach to the `Hitters` data from ISLR2 library. We wish to predict a baseball player's Salary on the basis of various statistics associated with performance in the previous year. Note that the Salary variable is missing in some rows, so we are going to omit them

```{r}
#nubmer of nas. 
sum(is.na(Hitters$Salary))
#remove nas
Hitters<- na.omit(Hitters)
sum(is.na(Hitters$Salary))
```

The `regsubsets()` function (part of the `leaps` library) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS. The syntax is the same as for 'lm()'. The 'summary()' command outputs the best set of variables for each model size.

```{r}
regfit.full <- leaps::regsubsets(Salary ~ ., Hitters)
(reg.summary<- summary(regfit.full))
```

For each subset size (number of features 1...n) it returns the best subset models. An asterisk indicates that a given variable is included in the corresponding model. For instance, this output indicates that the best two-variable model contains only Hits and CRBI.

By default, `regsubsets()` only reports results up to the best eight-variable model. But the `nvmax` option can be used in order to return as many variables as are desired.

```{r}
regfit.full <- regsubsets(Salary ~ ., data = Hitters,
    nvmax = 19)
(reg.summary <- summary(regfit.full))
```

We can see the R squared using `reg.summary$rsq`. Here we can see that the R2 statistic increases from 32%, when only one variable is included in the model, to almost 55%, when all variables are included. As expected, the R2 statistic increases monotonically as more variables are included, so this is not a good statistic for choosing the right size for the model because it will always recommend the model with more variables.

```{r}
reg.summary$rsq
```

So now that we have the best combination of variables for each model size, we can use cross validation or other methods to select the model with the best number of predictors.

We will see study Mallow's Cp, AIC or BIC soon, and `regsubset` gives us the statistics for the different models for those methods:

```{r, fig.align='center', fig.width=10, fig.height=4}
par(mfrow = c(2, 2))
plot(reg.summary$bic, xlab = "Number of Variables",
    ylab = "BIC", type = "l")
plot(reg.summary$cp, xlab = "Number of Variables",
    ylab = "Cp", type = "l")
plot(reg.summary$rsq, xlab = "Number of Variables",
    ylab = "RSS", type = "l")
plot(reg.summary$rss, xlab = "Number of Variables",
    ylab = "rss", type = "l")
```

with this information we will be able to select the model with the correct number of variables.
:::

## Stepwise Selection

### Forward Stepwise selection

The stepwise selection is very similar to the best subset as we start with a model with only the intercept and start aggregating predictors and selecting those with less RSS. The difference with best subset selection is that these models are nested and brings the number of models that we are considering from $2^p$ to $p^2$ which is more manageable. Because we are not considering every possible combination of all the predictors, this method is not guaranteed to find the model with less RSS.

Again, once we have the best version of the model with each number of predictors, we will choose among them using cross-validation or a similar technique.

### Backward stepwise selection

It runs a similar process but starting with a model with all the variables, then you run the model removing each variable and see which one is the least statistically significant one. The new (p-1) variable model is fit, and the variable with the largest $p$-value is removed. Continue until a stopping rule is reached, for instance, we may stop when all remaining variables have a significant $p$-value defined by some significance threshold.

One limitation of the backward stepwise selection is that we can only do it with n\>p (more samples than predictors). This limitation does not apply to forward stepwise selection.

::: exercise-box
Example

We can also use the `regsubsets()` function to perform forward stepwise or backward stepwise selection, using the argument `method = "forward"` or `method="backward"`. We are going to format a bit the output of the summary so it is more friendly to the reader:

```{r}
regfit.fwd <- regsubsets(Salary ~ ., data = Hitters,
    nvmax = 19, method = "forward")

regfit.bwd <- regsubsets(Salary ~ ., data = Hitters,
    nvmax = 19, method = "backward")

createSummaryTable <- function(regfitObject){

summary <- summary(regfitObject)
matrix_of_models <- as.data.frame(summary$which)
matrix_of_models <- matrix_of_models %>%
  mutate_all(~ifelse(., "**", ""))

kable(matrix_of_models, escape = FALSE) %>%
  kable_styling(full_width = FALSE) %>%
  row_spec(0, bold = TRUE) 
}

createSummaryTable(regfit.fwd)

createSummaryTable(regfit.bwd)
```

For instance, we see that using forwards stepwise selection, the best one-variable model contains only CRBI, and the best two-variable model additionally includes Hits. For this data, the best one-variable through six-variable models are each identical for best subset and forward selection. However, the best model for our number of variabless identified by forward and backward stepwise selection and best subset selection are different.

```{r}
coef(regfit.full, 7)

coef(regfit.fwd, 7)

coef(regfit.bwd, 7)
```

As we saw before, the summary() function also returns $R^2$, RSS, adjusted $R^2$, Cp, and BIC. We can examine these to try to select the best overall model.

```{r}
reg.summary <- summary(regfit.fwd)
names(reg.summary)
```

Plotting RSS, adjusted R Squared, Cp and BIC for all the models at once will help us decide which model to select. Note the `type ="1"` option tells R to connect the plotted points with lines. Here we are plotting RSS and Adjusted R Squared

```{r, fig.align='center', fig.width=10, fig.height=4}
par(mfrow = c(1, 2))
plot(reg.summary$rss, xlab = "Number of Variables",
    ylab = "RSS", type = "l")
plot(reg.summary$adjr2, xlab = "Number of Variables",
    ylab = "Adjusted RSq", type = "l")
```

The `points()` command works like the `plot()` command, except that it puts points on a plot that has already been created, instead of creating a new plot. The `which.max()` function can be used to identify the location of the maximum point of a vector. We will now plot a red dot to indicate the model with the largest adjusted R2 statistic, which would be the preferred one.

```{r, fig.align='center'}
max<- which.max(reg.summary$adjr2)
plot(reg.summary$adjr2, xlab = "Number of Variables",
    ylab = "Adjusted RSq", type = "l")
points(max, reg.summary$adjr2[max], col = "red", cex = 2, 
    pch = 20)
```

Similarly we can plot the Cp and the BIC statistics and indicate the models with the smallest statistic:

```{r, fig.align='center'}
min<- which.min(reg.summary$cp)
plot(reg.summary$cp, xlab = "Number of Variables",
    ylab = "Cp", type = "l")
points(min,reg.summary$cp[min], col = "red", cex = 2, 
    pch = 20)
```

```{r, fig.align='center'}
min<- which.min(reg.summary$bic)
plot(reg.summary$bic, xlab = "Number of Variables",
    ylab = "BIC", type = "l")
points(min, reg.summary$bic[min], col = "red", cex = 2,
    pch = 20)

```

The 'regsubsets()' function has a built-in 'plot()' command which can be used to display the selected variables for the best model with a given number of predictors, ranked according to the BIC, Cp, adjusted R2, or AIC. To find out more about this function, type ?plot.regsubsets. We will see the BIC here:

```{r, fig.align='center', fig.width=10, fig.height=8}
plot(regfit.fwd, scale = "bic")
```

The top row of the plot contains a black square for each variable selected according to the optimal model associated with that statistic. For instance, we see that several models share a BIC close to −150. However, the model with the lowest BIC is the six-variable model that contains only AtBat, Hits, Walks, CRBI, DivisionW, and PutOuts. We can use the `coef()` function to see the coefficient estimates associated with this model.

```{r}
coef(regfit.fwd, 6)
```

As we did before we can plot the results for Cp and and see the difference between our bakwards and forward step wise selections:

```{r}
plot(regfit.fwd, scale="Cp")
plot(regfit.bwd, scale="Cp")
```
:::

# Estimating the test error

-   We can indirectly estimate the test error by making an adjustment to the training error to account for the bias due to overfitting.

-   We can directly estimate the test error, using validation set approach or a cross-validation approach (see chapter [Resampling Methods](#resampling-methods)). These involve fitting the model in part of the dataset and testing in with unseen data.

We are going to see here other methods:

## Mallow's Cp

This technique adjust the training error for the model size, and can be used to select among a set of models with different numbers of variables. $$
C_p = \frac{1}{n}(RSS+2d\hat{\sigma}^2)
$$ {#eq-mallowscp}

Where d is the total number of parameters used and - $\hat{\sigma}^2$ is an estimate of the variance of the error term $\epsilon$ associated with each response measurement, calculated as: $$
\hat{\sigma}^2 = \frac{SSE}{n-p}
$$ {#eq-errorvariance} SSE is the *Sum of Squared Errors* or *Residual Sum of Squares*

$$
SSE = \sum^n_{i=1}(y_i-\hat{y_i})^2
$$ {#eq-SSESumSquaredErrors1}

This technique is restricted to those models where p\<n. Example using the mtcars dataset

```{r}
# Fit the model
model <- lm(mpg ~ hp + wt, data = mtcars)

# Extract the residuals
residuals <- model$residuals

# Calculate SSE
SSE <- sum(residuals^2)

# Number of observations
n <- length(mtcars$mpg)

# Number of predictors (including intercept)
p <- length(coefficients(model))

# Calculate \(\hat{\sigma}^2\)
sigma_hat_sq <- SSE / (n - p)

# Print the result
print(sigma_hat_sq)

```

Now we calculate Mallow's Cp for the same dataset for a model with with two parameters:

```{r}
# Fit the subset model
subset_model <- lm(mpg ~ hp + wt, data = mtcars)

# Calculate residuals and SSE for the subset model
residuals_subset <- subset_model$residuals
SSE_subset <- sum(residuals_subset^2)

# Number of predictors in the subset model (including intercept)
p_subset <- length(coefficients(subset_model))
print(SSE_subset)
# Calculate Mallows' Cp 
Cp <- (SSE_subset / sigma_hat_sq) + 2 * p_subset - n 
print(Cp)
```

When comparing two values for different models we want to choose the smallest Cp.

## AIC

The criterion is defined for a large class of models fit by maximum likelihood. We are not going to look at the formula right now, but it is similar to the previous and it is actually proportional to Cp, so again, we will choose the model with the smallest value.

## Bayesian Information Criterion (BIC)

$$
BIC = \frac{1}{n}(RSS+log(n)d\hat{\sigma}^2)
$$ {#eq-BIC}

We also look for the smallest value of the BIC for our models. The BIC statistic generally places a heavier penalty on models with many variables than the two methods seen before.

Because all these three methods rely on calculating hat sigma squared, all of them require that n\>p due to the formula to calculate it (@eq-errorvariance)

## Adjusted RSquared {#adjusted-rsquared}

For a least squared model with $d$ variables, the adjusted $R^2$ statistic is calculated as:

$$
\text{Adjusted } R^2= 1- \frac{RSS/(n-d-1)}{TSS/(n-1)}
$$ {#eq-adjustedRsquared}

where TSS is the total sum of squares.

Unlike Cp, AIC and BIC, for which a small value indicates a model with low test error, a large value of adjusted R squared indicates a model with small test error. Maximizing the adjusted $R^2$ is equivalent to minimizing RSS/(n-d-1). While RSS always decreases as the number of variables in the model increases, the presence of d in the denominator corrects for that, so the inclusion of unnecessary variables in the model pays a price.

The drawback of Adjusted Rsquared is that it cannot be used for other models, like logistic, for example.

::: exercise-box
Example: Validation Set Approach:

We saw how to choose our best models for the `Hitters` dataset using Best subset selection, Stepwise Selection and then using methods like BIC, AIC to select the best size for our model.

Now we will consider how to do this using the validation set and cross-validation approaches. In order for these approaches to yield accurate estimates of the test error, we must use only the training observations to perform all aspects of the model-fitting, including variable selection. If the full data set is used to perform the best subset selection step, the validation set errors and cross-validation errors that we obtain will not be accurate estimates of the test error. In order to use the validation approach, we begin by splitting the observations into a training set and a test set. We choose the size of the train set as 180 (approx 2/3 of the data)

```{r}
objects_to_remove <- ls(pattern = "^regfit") 
rm(list = objects_to_remove)
n<-nrow(Hitters)
set.seed(1)
train <- sample(seq(1:n), 180, replace= FALSE)

```

now we apply `regsubsets()` to the training set in order to perform best subset selection.

```{r}
regfit.fwd<- regsubsets(Salary ~ ., data = Hitters[train, ], nvmax= 19, method="forward")
createSummaryTable(regfit.fwd)

```

We now make a model matrix from the test data:

```{r}
test.mat <- model.matrix(Salary ~ ., data = Hitters[-train, ])
head(test.mat)
```

The `model.matrix()` function is used in many regression packages for building an X matrix from data. Now we run a loop, and for each size i, we extract the coefficients from `regfit.fwd` for the best model that size, multiply them into the appropriate columns of the test model matrix to form the predictions, and compute the test Mean Square for Error (MSE).

First we will do a sample iteration showing some of the the values so we can see what the loop is doing we choose the model with three variables + the intercept:

```{r}
(coefi <- coef(regfit.fwd, id = 3))
 print(names(coefi))
 head(test.mat[, names(coefi)])
 pred <- test.mat[, names(coefi)] %*% coefi
 head(pred)
 salaryTestData<- Hitters$Salary[-train]
 meanSquareError<- mean((salaryTestData - pred)^2)
```

Let's run the loop now:

```{r}
val.errors <- rep(NA, 19)
for (i in 1:19){
  coefi <- coef(regfit.fwd, id = i)
  pred <- test.mat[, names(coefi)] %*% coefi
  val.errors[i] <- mean((Hitters$Salary[-train] - pred)^2)
}

```

we can now plot the square root of MSE to see the results. Taking the square root of MSE gives us the *Root Mean Squared Error (RMSE)*, which is on the same scale as the original data. This makes it easier to interpret and compare. The model summary does not give us MSE directly, but we can calculate it: $MSE=RSS/n$. In our code: `regfit.fwd$rss[-1]/length(Hitters$Salary[test])` the minus one is to remove the intercept.

```{r, fig.align='center', fig.width=7}
plot(sqrt(val.errors), ylab= "Sqrt MSE", pch=1,type="b",ylim= c(250,450))
points(which.min(val.errors), sqrt(val.errors[which.min(val.errors)]), col="red", cex = 2,
    pch = 20, )

points(sqrt(regfit.fwd$rss[-1]/length(Hitters$Salary[train])),col="blue", pch =1, type="b")
legend("topright", legend=c("Training","Validation"), col= c("blue","black"), pch=1)

```

the best model is the one with the min error:

```{r}
val.errors
min<-which.min(val.errors)
coef(regfit.fwd, min)
```

This was a little tedious because there is no predict method for `regsubsets()`. Since we will be using this function again, we can write our own method:

```{r}
predict.regsubsets <- function(object, newdata, id, ...) {
  form <- as.formula(object$call[[2]]) #extract call formula
  mat <- model.matrix(form, newdata) 
  coefi <- coef(object, id = id)
  xvars <- names(coefi)
  mat[, xvars] %*% coefi
 }

```

Our function pretty much mimics what we did above. The only complex part is how we extracted the formula used in the call to `regsubsets()`. We demonstrate how we use this function below, when we do cross-validation.

Finally, we perform best subset selection on the full data set, and select the best model for our number of variables. It is important that we make use of the full data set in order to obtain more accurate coefficient estimates. Note that we perform best subset selection on the full data set and select the best model for our number of variables, rather than simply using the variables that were obtained from the training set, because the best model for our number of variables on the full data set may differ from the corresponding model on the training set.

In fact, we see that the best model for our number of variables on the full data set has a different set of variables than the best model for our number of variables on the training set.

```{r}
regfit.best <- regsubsets(Salary ~ ., data = Hitters,
    nvmax = 19)
coef(regfit.best, min)
```
:::

# Resampling Methods {#resampling-methods}

## Monte Carlo Method

The Monte Carlo method is a powerful statistical technique used to understand the impact of risk and uncertainty in prediction and forecasting models.

If for example we want to know the average height of all people living in USA, we estimate the parameter we are interested in (average height of the population) using a statistic estimator that is the average height of a sample of the population drawn at random. $\theta$ is the parameter we want to calculate and we use $\hat{\theta}$ that is the average of our sample. By the law of large numbers, the approximation error can be made arbitrarily small by using a large enough sample size.

What if we are interested in an estimator $\hat{\theta}$ for some parameter $\theta$ of the population and the normal approximation is not valid for that estimator? In such situations, simulations can often be used to estimate the parameters.

The Monte Carlo method relies on repeated random sampling to obtain numerical results. The core idea is to use randomness to solve problems that might be deterministic in nature. This method is named after the Monte Carlo Casino in Monaco, reflecting the element of chance and randomness.

In our example to estimate the height of the population of USA adults we get many (let's say 1000) samples of n=100 observations. We then compute the estimator $\hat{\theta}$ for each sample (in our case the average height), resulting in 1000 estimates. Now we compute the standard deviation of these 1000 estimates and it results to be close to the standard error of my estimator. :

$$
SE(\hat{\theta})=\sqrt{E(\hat{\theta}-E(\hat{\theta}))^ 2}
$$

$$
s(\hat{\theta}_1,...,\hat{\theta}_{1000})= \sqrt{\frac{1}{999}\sum^{1000}_{i=1}(\hat{\theta}-mean(\hat{\theta_i}))^ 2} \approx SE(\hat{\theta})
$$

The caveat of this method is that it will only work where we can extract as many samples as we wish.

Usually we cannot have as many samples as we wish, when we want to use the Monte Carlo Method in practice, it is typical to assume a parametric distribution and generate a population from this, which is called a **parametric simulation**. This means that we take parameters estimated from the real data and using the mean and standard deviation, we plug these into a model of normal distribution and generate new samples.

For example for the case of mice weights, we know that mice typically weight 24 gr with a SD of about 3.5 gr. and that the distribution is approximately normal, to generate population data:

```{r}
controls<- rnorm(5000, mean=24, sd=3.5) 
```

## Cross-validation {#crossvalidation}

As the models become more complex (the more features we add to our model), the [training error](#testingtrainingerror) becomes lower. On the other hand, the *test error* does not reduces at the same rate as the model gets more complex, it starts going down but then it starts increasing again as the model starts overfitting to the training dataset.

The prediction error comes from bias and variance. The *bias* is how var off on the average the model is from the truth. The *variance* is how much the estimate varies around its average. The less parameters we fit into our model, the more bias we have (we will find it more difficult to find the true mean) and as we increase the parameters, the bias reduces, but the variance goes up, because we have more and more parameters to estimate from the same amount of data. The solution for this would be a large designated test set, but this is not often available.

There are some methods to try to adjust the training error so it does not over fit and produces an increase in the testing error rate. We randomly divide the available set of samples into two parts: a *training set* and a *validation* or *hold out set*. The model is fit on the training set and the fitted model is used to predict responses for the observations in the validation set. The resulting validation-set error provides an estimate of the test error. This is typically assessed using [Mean Squared Error (MSE)](#MeanSquaredError) in the case of quantitative response and misclassification rate in the case of qualitative (discrete) responses.

```{r, fig.aligh ='center', fig.width=12, fig.height=6, fig.cap='two-fold cross validation'  }
# Set up the plotting area for two plots side by side
par(mfrow = c(1, 2))

# Plot 1: MSE with All Data
# Initialize a vector to store MSE errors for each polynomial degree
mse_error_all <- rep(0, 10)

# Loop to calculate MSE errors for polynomial degrees 1 to 10
for (i in 1:10) {
  glm_fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
  predictions <- predict(glm_fit)
  actuals <- Auto$mpg
  mse_error_all[i] <- mean((predictions - actuals)^2)
}

# Plot the MSE errors for all data
plot(1:10, mse_error_all, 
     type = "b", 
     col = 1, 
     ylim= c(0,(max(mse_error_all) + 1)),
     xlab = "Degree of Polynomial", 
     ylab = "MSE", 
     main = "MSE with All Data")

# Plot 2: MSE with Different Samples
# Initialize a list to store MSE errors for each sample
mse_error_samples <- list()

# Loop to create different samples and calculate MSE errors
for (j in 1:10) {
  # Create a different sample
  train <- sample(c(TRUE, FALSE), nrow(Auto), replace = TRUE, prob = c(0.6, 0.4))
  
  # Initialize a vector to store MSE errors for the current sample
  mse_error_samples[[j]] <- rep(0, 10)
  
  # Loop to calculate MSE errors for polynomial degrees 1 to 10
  for (i in 1:10) {
    glm_fit <- glm(mpg ~ poly(horsepower, i), data = Auto[train, ])
    predictions <- predict(glm_fit, newdata = Auto[!train, ])
    actuals <- Auto$mpg[!train]
    mse_error_samples[[j]][i] <- mean((predictions - actuals)^2)
  }
}

y_max_samples <- max(unlist(mse_error_samples)) + 1

# Plot the MSE errors for different samples
plot(1:10, mse_error_samples[[1]], type = "b", col = 1, xlab = "Degree of Polynomial", ylab = "MSE", main = "MSE with Different Samples",
  ylim = c(0, y_max_samples))
for (j in 2:10) {
  lines(1:10, mse_error_samples[[j]], type = "b", col = j)
}
legend("bottomleft", legend = paste("Sample", 1:10), col = 1:10, lty = 1, cex = 0.8)

```

In the first graph we can see that a polynomial of level 2 reduces drastically the MSE from a linear model, but more polynomial terms do not have the same effect. The second graph shows us how the MSE vary with different samples, so the choice of the sample will have an effect on the fit of the model. Another drawback of this method is that, by fitting the model to a smaller number of data points, the estimated error will be higher than fitting the model to the whole available data points.

Now we will see some techniques that try to solve some of those drawbacks.

### K-fold Cross-validation

*K-fold cross validation* is a widely used approach for estimating test error. The idea is to randomly divide the data into $K$ equal-sized parts. We leave out part $k$, fit the model to the other $K-1$ parts (combined) and then obtain predictions for the left-out $kth$ part.This is done in turn for each part k, and then the results are combined. $K$ is usually between 5 and 10. Let's the $K$ parts be $C_1,C_2,\dots,C_k$. where $C_k$ denotes the indices of the observations in part k. There are $n_k$ observations in part k. $$
CV_{(K)}= \sum^K_{k=1}\frac{n_k}{n} MSE_k
$$ {#eq-kfoldCrossValidationErrorRate}

Where: $CV_{(K)}$: This represents the cross-validation error estimate for K-fold cross-validation. $\sum^K_{k=1}$: This indicates that the sum is taken over all $K$ folds. $\frac{n_k}{n}$: This is the proportion of the total data used in the k-th fold, where $n_k$ is the number of observations in the k-th fold and $n$ is the total number of observations. $MSE_k$: This is the Mean Squared Error for the k-th fold. $MSE_k=\sum_{i\epsilon C_k}(y_i-\hat{y_i})^2/n_k$ and $\hat{y_i}$ is the fit for the observation i, obtained from the data with the part $K$ removed.

One special use of the k-fold cross validation is setting K=n, this is called *leave-one-out cross validation (LOOCV)*: each observation is left out and used as a validation set, while the whole set of other data points are used for training the model. Leave one out cross validation for linear or polynomial model uses this formula: $$
CV_{(n)}=\frac{1}{n} \sum_{i=1}^n \left(\frac{y_i-\hat{y_i}}{1-h_i} \right)^2
$$ {#eq-leveoneoutCrossValidationErrorRate}

where $\hat{y_i}$ is the ith fitted value for the original least squares fit and $h_i$ is the leverage (diagonal of the "hat" matrix). This is like the ordinary MSE except the ith residual is divided by $1-h_i$ The $h_i$ tells you how much influence a single point has on the its own fit, it will be a number between 0 and 1.

Cross-validation takes the average of the errors over the n-folds so in LOOCV each sample looks much more like the other ones, because it only leaves one point out. As a result, each error estimate are highly correlated, when you average these highly correlated errors, the resulting average has a high variance -it is more sensitive to slight changes or outliers- because there's little variation in the training data across folds. It's not about the individual errors being different; it's about them moving together because the training sets are too similar. Remember the trade off we talked about between bias and variance in the [Bias-variance trade off](#bias-variance-trade-off) section

For this reason it is often preferred to use k=5 or 10. Let's compute the cross-validation in r using the formula we saw above. First we are going to fit the model to a linear model. We can use the function glm and if we don't pass any family type it will fit to a linear model.

```{r, fig.cap='LOOCV linear model'  }

#linear model:
fit.lm<- glm(mpg ~ horsepower, data = Auto)
loocv<- function(fit){
  h <-lm.influence(fit)$h
  mean((residuals(fit)/(1-h))^2)
}

loocv(fit.lm)
```

But we have a handy function in r that will do this for us: We will use the `library(boot)` to calculate the cross validation errors for the Auto dataset using the function `cv.glm`. If we don't pass any value of k, it will default to leave-one-out cross-validation.

```{r, fig.cap='LOOCV linear model'  }

glm.fit <- glm(mpg ~ horsepower, data = Auto)
cv.err <- boot::cv.glm(Auto, glm.fit)
cv.err$delta
```

And we see that we get the same result that the one we calculated manually above. The first Delta is the cross-validation prediction error, and the second one is a biased corrected version of it, taking into consideration that the dataset is smaller for the dataset in cross validation. This has little effect in leave one out, but it will have a bigger effect in k=n cross-validation.

Now we are going to fit polynomials of different degrees to our data because it looks non-linear, so we expect some of these to fit better than a linear model.

```{r, fig.aligh ='center',fig.width=12, fig.height=4, fig.cap='LOOCV for different polynomials'}
# Leave one out
cv.error <- rep(0, 10)
for (i in 1:10) {
  glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
  cv.error[i] <- boot::cv.glm(Auto, glm.fit)$delta[1]
}
# Plot the MSE errors for all data
plot(1:10, cv.error, 
     type = "b", 
     col = 1, 
     ylim= c(2,(max(cv.error) + 1)),
     xlab = "Degree of Polynomial", 
     ylab = "MSE", 
     main = "Leave One Out")
```

Now let's try 10-k fold cross-validation.

```{r, fig.aligh ='center', fig.width=12, fig.height=4, fig.cap=' 10k-fold cross validation' }
## $k$-Fold Cross-Validation
# Initialize a list to store CV errors for each sample
cv.error.10 <- list()

# Loop to create different samples and calculate MSE errors
for (j in 1:10) {
  # Initialize a vector to store MSE errors for the current sample
  cv.error.10[[j]] <- rep(0, 10)
  
  # Loop to calculate MSE errors for polynomial degrees 1 to 10
  for (i in 1:10) {
    glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
    cv.error.10[[j]][i] <- boot::cv.glm(Auto, glm.fit, K = 10)$delta[1]
  }
}

y_max_samples <- max(unlist(cv.error.10)) + 1

# Plot the MSE errors for different samples
plot(1:10, cv.error.10[[1]], type = "b", col = 1, xlab = "Degree of Polynomial", ylab = "MSE", main = "k=10 fold CV",
  ylim = c(2, y_max_samples))
for (j in 2:10) {
  lines(1:10, cv.error.10[[j]], type = "b", col = j)
}
legend("bottomleft", legend = paste("k", 1:10), col = 1:10, lty = 1, cex = 0.8)

```

We can see how the variability in 10-k fold is much less than what we obtained before using half of the data (2-k)

For classification problems the k-fold cross-validation works exactly the same, the only difference is that instead of MSE to measure the cross validation error we use the \[misclassification error rate\] (#misclassificationErrorRate).

### Potential problems with cross-validations.

Consider a simple classifier applied to some two-class data. We have 5000 predictors and 50 samples. With this, we find the 100 predictors having the largest correlation with the class labels. We then apply a classifier such as logistic regression using only those 100 predictors. If we now try to estimate the test set performance of this classifier using cross-validation, we will get a much lower error estimate than real. This is because when we choose the 100 best predictors, this is a form of training for the model, and must be included in the validation process. If for example we simulate data with the class labels independent of the outcome (so our test error is 50%) and then we choose the best predictors and create a model with them, when we calculate the cross-validation, our error estimate will be smaller than it should be.

To avoid this we will have to create our folds and remove our test fold before the process of fitting and choosing the predictors, and then apply the cross validation over the fold we set apart.

You can find more information in the [lectureVideo](#https://www.youtube.com/watch?v=jgoa28FR__Y)

::: exercise-box
Example: Model selection with Cross Validation:

We now try to choose among the models of different sizes using cross-validation. We must perform best subset selection *within each of the* $k$ training sets. Despite this, we see that with its clever subsetting syntax, `R` makes this job quite easy.

First, we create a vector that allocates each observation to one of $k=10$ folds, and we create a matrix in which we will store the results.

```{r }
k <- 10
n <- nrow(Hitters)

folds <- sample(rep(1:k, length = n))
cv.errors <- matrix(NA, k, 19,
    dimnames = list(NULL, paste(1:19)))
```

Now we write a for loop that performs cross-validation. In the $j$th fold, the elements of `folds` that equal `j` are in the test set, and the remainder are in the training set. We make our predictions for each model size (using our new `predict()` method), compute the test errors on the appropriate subset, and store them in the appropriate slot in the matrix `cv.errors`. Note that in the following code `R` will automatically use our `predict.regsubsets()` function when we call `predict()` because the `best.fit` object has class `regsubsets`.

```{r }
for (j in 1:k) {
  best.fit <- regsubsets(Salary ~ .,
       data = Hitters[folds != j, ],
       nvmax = 19)
  for (i in 1:19) {
    pred <- predict(best.fit, Hitters[folds == j, ], id = i)
    cv.errors[j, i] <-
         mean((Hitters$Salary[folds == j] - pred)^2)
   }
}
```

This has given us a $10 \times 19$ matrix, of which the $(j,i)$th element corresponds to the test MSE for the $j$th cross-validation fold for the best $i$-variable model. We use the `apply()` function to average over the columns of this matrix in order to obtain a vector for which the $i$th element is the cross-validation error for the $i$-variable model.

```{r, fig.align='center'}
(mean.cv.errors <- apply(cv.errors, 2, mean))

par(mfrow = c(1, 1))

plot(sqrt(mean.cv.errors), type = "b", ylab="sqrt MSE")

```

We see that cross-validation selects a 10-variable model. We now perform best subset selection on the full data set in order to obtain the 10-variable model.

```{r}
reg.best <- regsubsets(Salary ~ ., data = Hitters,
    nvmax = 19)
coef(reg.best, 10)
```
:::

## The bootstrap principle {#bootstrap}

The *bootstrap* is a flexible and powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical method. For example, it can provide an estimate of the standard error of a coefficient, or a confidence interval for that coefficient.

We have an estimate $\hat{\theta}$ for a parameter $\theta$ and want to know how accurate our estimate is. For that we need to know the $SE$ of the estimate or give a confidence interval for the parameter $\theta$. The bootstrap can do this in quite general settings:

Example $\theta$= average height of all people in the US. It is unknown but estimated with the average height of 100 randomly selected people. We can't compute the population mean because we can't access the whole population. So we ' plug in' the sample in place of the population and compute the mean of the sample instead. The rationale for the plug-in principle is that the sample mean will be close to the population mean because the sample histogram is close to the population histogram.

The bootstrap uses the plug-in principle and the Monte Carlo Method to approximate quantities such as $SE(\hat{ \theta})$ when we cannot get many samples.

The bootstrap pretends that the sample histogram is the population histogram and uses the Monte Carlo to simulate the quantity of interest.

What we do is we draw n samples **with replacement** from the original sample, each sample is the same size $n$ as the original dataset, and with those samples we use the Monte Carlo principle to calculate the quantity of interest. Example of 4 samples chosen at random from a dataset with three points:

```{r}
x <- rnorm(3,mean=5,sd=2)
Y <- 0.2+0.33*x+ rnorm(1, -1.1, .8)

pop<- data.frame(obs=c("A","B","C"),x,Y)
pop
#bootstrap samples:
set.seed(1)
s<- sample(c(1,2,3),3,replace=TRUE)
pop[s,]
set.seed(2)
s<- sample(c(1,2,3),3,replace=TRUE)
pop[s,]
set.seed(3)
s<- sample(c(1,2,3),3,replace=TRUE)
pop[s,]
set.seed(4)
s<- sample(c(1,2,3),3,replace=TRUE)
pop[s,]
```

Drawing a bootstrap sample by sampling with replacement from the data is called *nonparametric bootstrap*.

If we know that the data follow a normal distribution, but we don't know the mean on the standard deviation. The obvious thing to do there is to simply estimate the unknown mean and standard deviation, and then simply sample from the normal distribution, with that mean and standard deviation. That's called *parametric bootstrap*.

This type of sampling works if the data are independent, that is, $X_1$ to $X_n$ are really generated independently. But oftentimes, there's dependence in the data. For example, the data are observed over time.

If the sampling distribution of $\hat{\theta}$ is approximately normal, then the confidence interval formula is :

$$
\left[ \hat{\theta} - z_{\alpha/2} SE(\hat{\theta}), \hat{\theta} + z_{\alpha/2} SE(\hat{\theta}) \right]
$$ {#eq-conficenceIntervalEstimatorNormalDistribution}

This formula represents a confidence interval for an estimator $\hat{\theta}$, where:

-   $\hat{\theta}$ is the point estimate.

-   $z_{\alpha/2}$ is the critical value from the standard normal distribution for a given confidence level. For example for a confidence level of 95% $\alpha =(1-0.95)=0.05$ ; $\frac{\alpha}{2}=0.025$, $z_{\frac{\alpha}{2}}=1.96$

-   $SE(\hat{\theta})$ is the standard error of the estimator.

If the distribution is not normal, we can still use the bootstrap by estimating that the distribution of the estimate can be approximated by the distribution of the bootstrap-copies extracted from it, so we can draw a distribution curve and calculate the 95% confidence interval using that curve:

$$
\left[ \hat{\theta}^*_{\alpha/2}, \hat{\theta}^*_{1-\alpha/2} \right]
$$ {#eq-conficenceIntervalEstimatorNotGaussian}

where $\hat{\theta}^*_{\alpha/2}$ is the $\alpha/2$ percentile of the $\hat{\theta}^*_{1},....\hat{\theta}^*_{B}$

::: exercise-box
Estimating the accuracy of a linear regression Model

The bootstrap approach can be used to assess the variability of the coefficient estimates and predictions from a statistical learning method. Here we use the bootstrap approach in order to assess the variability of the estimates for $\beta_0$ and $\beta_1$ (intercept and slope) for a linear regression model that uses horsepower to predict mpg (miles per gallon) in the Auto data set.

We first create a simple function `boot.fn()` which takes in the Auto data set as well as the indices for the observations, and returns the intercept and slope estimates for the linear regression model.

```{r}
boot.fn<- function (data, index){
  coef(lm(mpg ~ horsepower, data= data, subset = index))
}
```

We then apply this function to the full set of 392 observations in order to compute get the slope and intersect for the full data set:

```{r}
boot.fn(Auto,1:392)
```

if we use bootstrap principle we just need to create a sample with replacement of the same size of the full data set. This is one sample example:

```{r}
boot.fn(Auto, sample(392,392, replace=TRUE))
```

but what we do in r is to use the `boot()` function from the boot library to compute the standard errors of 1,000 bootstrap estimates for the intercept and the slope:

```{r}
set.seed(1)
boot(Auto, boot.fn,1000)
```

This indicates that the bootstrap estimate for $SE(\beta_0)$ is 0.84 and for $SE(\beta_1)$ is 0.0073

Standard formulas can be used to compute the standard errors for the regression coefficients in a linear model. These can be obtained using the summary() function.

```{r}
summary(lm(mpg ~ horsepower, data= Auto))$coef
```

these standard errors are calculated using the formulas we saw in [Assessing the Accuracy of the Coefficient Estimates and Confidence Intervals](#assessing-the-accuracy-of-the-coefficient-estimates-and-confidence-intervals)

The [formula for the slope](/linearModels.html#eq-standardErrorSlope) and for the [intercept](/linearModels.html#eq-standardErrorIntercept)

These formulas rely on certain assumptions (out of the scope for this document, check the ISLR book for more info). The bootstrap approach does not rely on any of these assumptions, and so, it is likely giving more accurate estimate of the standard errors than the summary function
:::

If the data is a time series, we can't simply sample the observations with replacement because the data is not independent. There's correlation in time series. For example we want to predict the value of a stock price using the previous days stock price. We expect the stock price for a given day to be correlated with the stock price from the previous day. This creates a problem for the bootstrap because bootstrap assumes the data are independent.

What people uses for time series is the *block boostrap*, it divides the data up into blocks and we assume that in between blocks, the data is independent.

Another instance where bootstrap is not a good approach is for estimating the prediction error in cross-validation problems. In cross-validation, each of the $K$ validation folds is distinct from the other K-1 folds used for training, there is no overlap, this is crucial for its success. But each bootstrap sample has significant overlap with the original data. About two-thirds of the original data points appear in each bootstrap sample. This will cause the bootstrap to seriously underestimate the true prediction error.

::: exercise-box
Estimating the Accuracy of a Statistic of Interest

We will be using the dataset Portfolio from LSLR package for this practice. Suppose that we wish to invest a fixed sum of money in two financial assets that yield returns of X and Y, respectively, where X and Y are random quantities. We will invest a fraction α of our money in X, and will invest the remaining 1 − α in Y . Since there is variability associated with the returns on these two assets, we wish to choose α to minimize the total risk, or variance, of our investment.In other words, we want to minimize Var(αX +(1−α)Y ). The value that minimizes the risk is given by the formula:

$$
\alpha= \frac{\sigma^2_Y-\sigma_{XY}}{\sigma^2_X+\sigma^2_Y-2\sigma_{XY}}
$$

where $\sigma^2_X$= VAR(X), $\sigma^2_Y$= VAR(Y) and $\sigma_{XY}$ = COV(X,Y). These quantities are unknown, so we will estimate them using our data. We can write that function in r:

```{r}
alpha.fn <- function(data, index) {
  X <- data$X[index]
  Y <- data$Y[index]
  (var(Y) - cov(X, Y)) / (var(X) + var(Y) - 2 * cov(X, Y))
}

alpha.fn(Portfolio, 1:nrow(Portfolio))
```

but what is the sampling variability of α? we use boostrap for this:

```{r, fig.align='center', fig.width=7}
set.seed(1)
boot.out<- boot::boot(Portfolio, alpha.fn, R = 1000)
boot.out

plot(boot.out)

```

the statistic came as 0.5758 as before and we now know our standard error that came as 0.093

The plots show you the distribution of the samples and the QQ plot.
:::

## Use of Monte Carlo Simulation and Bootstrap

Simulations can also be used to check theoretical or analytical results. Also, many of the theoretical results we use in statistics are based on asymptotics: they hold when the sample size goes to infinity. In practice, we never have an infinite number of samples so we may want to know how well the theory works with our actual sample size. Sometimes we can answer this question analytically, but not always. Simulations are extremely useful in these cases.

As an example we are going to use simulation to compare the Central Limit Theorem to the t-distribution for different sample sizes. We will use our mice data and extract control and 'fake' treatment samples. The fake treatment samples are just samples labeled as treatment but extracted from the control population.

We will build a function that automatically generates a t-statistic under the null hypothesis for a sample size of `n`.

```{r}
dat <- read.csv("data/mice_pheno.csv")
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%  
  dplyr::select(Bodyweight) %>% unlist

ttestgenerator <- function(n) {
  cases <- sample(controlPopulation,n)
  controls <- sample(controlPopulation,n)
  tstat <- (mean(cases)-mean(controls)) / 
      sqrt( var(cases)/n + var(controls)/n ) 
  return(tstat)
  }
ttests <- replicate(1000, ttestgenerator(10))
```

```{r ttest_hist, fig.cap="Histogram of 1000 Monte Carlo simulated t-statistics.", fig.align='center'}
hist(ttests)
```

we can use quantile-quantile plots to see how well this distribution approximates to the normal:

```{r ttest_qqplot, fig.cap="Quantile-quantile plot comparing 1000 Monte Carlo simulated t-statistics to theoretical normal distribution.", fig.align='center'}
qqnorm(ttests)
abline(0,1)
```

This looks like a very good approximation. For this particular population, a sample size of 10 was large enough to use the CLT approximation. How about sample size 3?

```{r, ttest_df3_qqplot, fig.align='center',fig.cap="Quantile-quantile plot comparing 1000 Monte Carlo simulated t-statistics with three degrees of freedom to theoretical normal distribution."}
ttests <- replicate(1000, ttestgenerator(3))
qqnorm(ttests)
abline(0,1)
```

Now we see that the large quantiles, referred to by statisticians as the *tails*, are larger than expected (below the line on the left side of the plot and above the line on the right side of the plot). In the previous module, we explained that when the sample size is not large enough and the *population values* follow a normal distribution, then the t-distribution is a better approximation. Our simulation results seem to confirm this:

In the code below first we generate the percentiles: `ps <- (seq(0,999)+0.5)/1000` seq(0,999) generates a sequence of numbers from 0 to 999. Adding 0.5 to each element centers the sequence values. Dividing by 1000 scales the values to be between 0 and 1. This results in 1000 equally spaced percentiles between 0.0005 and 0.9995. `qt(ps, df=2*3-2)` computes the quantiles of the t-distribution for the given percentiles `ps` with degrees of freedom `df=2*3-2` qqplot creates a QQ plot comparing the quantiles of our t-test results to the theoretical quantiles of the t-distribution.

```{r, ttest_v_tdist_qqplot,fig.cap="Quantile-quantile plot comparing 1000 Monte Carlo simulated t-statistics with three degrees of freedom to theoretical t-distribution.", fig.align='center'}
ps <- (seq(0,999)+0.5)/1000
qqplot(qt(ps,df=2*3-2),ttests,xlim=c(-6,6),ylim=c(-6,6))
abline(0,1)
```

The QQ plot helps you visually assess how well your t-test results follow the theoretical t-distribution. The t-distribution is a much better approximation in this case, but it is still not perfect. This is due to the fact that the original data is not that well approximated by the normal distribution. We can see this plotting the qqplot for the full population data:

```{r, dat_qqplot, fig.cap="Quantile-quantile of original data compared to theoretical quantile distribution.", fig.align='center'}
qqnorm(controlPopulation)
qqline(controlPopulation)
```

# Shrinkage Methods

**Ridge regression and lasso.**

The subset selection methods use least squares to fit a linear model that contains a subset of the predictors. As an alternative, we can fit a model containing all $p$ predictors using a technique that constrains or regularizes the coefficients estimates, or equivalently, that shrinks the coefficient estimates towards zero.Shrinking the coefficient estimates can significantly reduce their variance.

**Ridge regression** Also called $L2$ Regularization. It addresses the problem of multicollinearity (when predictor variables are highly correlated). It is useful when you have many predictors and suspect multicollinearity. It works well when the number of predictors is larger than the number of observations.

$$
\sum^n_{i=1}\left( y_i-\beta_0-\sum^p_{j=1}\beta_jx_{ij}\right) + \lambda\sum^p_{j=1}\beta^2_j
$$ {#eq-ridgeRegression}

the first part of the function is RSS so we can write it like this: $$
RSS + \lambda\sum^p_{j=1}\beta^2_j
$$ We call $\lambda\sum^p_{j=1}\beta^2_j$ the *Ridge Penaly* where $\lambda$ is called the *tuning parameter*, to be determined separately.It controls the strength of the penalty. When $\lambda$ is 0, it is just ordinary least square regression, as $\lambda$ increases, the flexibility of the model decreases, helping to reduce overfitting.

As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small, however, the second term of the function, called shrinkage penalty, is small when the coefficients are close to zero, and so it has the effect of shrinking the estimates towards zero.

One thing to take into account with this methods is that while in our regression formulas the units or scales of our variables do not matter, (if we measure the weight in grams or in kilograms) because the coefficient will adapt to that scale, but in this method that actually matters, and as a result is is important to standardize the predictors before applying ridge regression. To standardize he predictors we do as usual, we divide the feature by the standard deviation of that feature. $$
\tilde x_{ij} = \frac{x_{ij}}{SD(x_{ij})} =\frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n}(x_{ij-\bar{x_j}})^2}
$$ Ridge regression does have one obvious disadvantage: unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all $p$ predictors in the final model.

**Lasso** Also called $L1$ regularization. The Lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The formula is very similar to the ridge regression but instead of taking the squares of the coefficients, now it is the sum of the absolute value: $$
RSS + \lambda\sum^p_{j=1}|\beta_j|
$$ The difference is that it actually sets the coefficients to 0 when lambda is large enough, so it subsets the selection removing the features that are not important.

**Selecting the Tuning Parameter for Ridge Regression and Lasso**

As in ridge regression, selecting a good value of lambda for the lasso is critical, we will use again cross validation for this. Lasso is particularly useful when you expect only a few of predictors to be relevant, leading to a simpler, more interpretable model. If most of the predictors are relevant, ridge regression will do better selecting the best model. We don't know this when we create the model, so it is usual to use both methods, and use cross-validation to determine the best model coming out of each method by comparing the cross-validated error.

Cross validation provides a simple way to find the value of $\lambda$ both for lasso and ridge regression. We choose a grid of lambda values, and compute the cross-validation error rate for each value of it. We then select the tuning parameter value for which the cross-validation error is smallest. Finally the model is re-fit using all of the available observations and the selected value of the tuning parameters.

::: exercise-box
Example: Ridge regression

We will use the `glmnet` package in order to perform ridge regression and the lasso. The main function in this package is `glmnet()`, which can be used to fit ridge regression models, lasso models, and more. This function has slightly different syntax from other model-fitting functions that we have encountered thus far. In particular, we must pass in an x matrix as well as a y vector, and we do not use the y \~ x syntax. We will now perform ridge regression and the lasso in order to predict Salary on the Hitters data. Before proceeding ensure that the missing values have been removed from the data. The `model.matrix()` function is particularly useful for creating x; not only does it produce a matrix corresponding to the 19 predictors but it also automatically transforms any qualitative variables into dummy variables. The latter property is important because `glmnet()` can only take numerical, quantitative inputs.

```{r}
Hitters <- na.omit(Hitters)
x <- model.matrix(Salary ~ . ,data= Hitters)[, -1]
y <- Hitters$Salary
```

The `glmnet()` function has an alpha argument that determines what type of model is fit. If alpha=0 then a ridge regression model is fit, and if alpha=1 then a lasso model is fit. We first fit a ridge regression model.

By default the `glmnet()` function performs ridge regression for an automatically selected range of λ values. However, here we have chosen to implement the function over a grid of values ranging from $λ=10^{10}$ to $λ=10^{−2}$ , essentially covering the full range of scenarios from the null model containing only the intercept, to the least squares fit. As we will see, we can also compute model fits for a particular value of λ that is not one of the original grid values. Note that by default, the `glmnet()` function standardizes the variables so that they are on the same scale. To turn off this default setting, use the argument `standardize = FALSE`.

```{r}
grid <- 10^seq(10, -2, length = 100)
ridge.mod <- glmnet(x, y, alpha = 0, lambda = grid)
```

Associated with each value of λ is a vector of ridge regression coefficients, stored in a matrix that can be accessed by coef(). In this case, it is a 20×100 matrix, with 20 rows (one for each predictor, plus an intercept) and 100 columns (one for each value of λ).

```{r}
dim(coef(ridge.mod))
ridge.mod$lambda[50] #the value of lambda at possition 50
coef(ridge.mod)[,50] #coefficients for that value of lambda
plot(ridge.mod, xvar="lambda", label=TRUE)
```

This plot is showing how the coefficients shrink as we increase the value of lambda.

We can use predict() function for a number of purposes For instance we can obtain the ridge regression coefficients for a new value of lambda, say 50:

```{r}
predict(ridge.mod, s=50, type= 'coefficients')[1:20,]
```

We now split the samples into a training set and a test set in order to estimate the test error of ridge regression and fit a ridge regression model on the training set and evaluate its MSE on the test set using lambda=4. Note the use of `predict()` this time we get predictions for a test set by replacing the `type="coefficients"` with the `newx` argument

```{r}
set.seed(1)
train<- sample(1:nrow(x), nrow(x)/2)
test <- (-train)
y.test <- y[test]

ridge.mod <- glmnet(x[train,], y[train], alpha=0, lambda=grid, thresh= 1e-12)
ridge.pred <- predict(ridge.mod, s=4, newx=x[test,])
(MSE<- mean((ridge.pred-y.test)^2))
```

the test MSE = `{r} MSE`

We now check whether there is any benefit to performing ridge regression with lambda=4 instead of just performing least squares regression (lambda=0). In order for `glmnet()` to yield exact least squares coefficients when lambda is 0, we use the argument `exact=T`

```{r}
ridge.pred <- predict(ridge.mod, s = 0, newx = x[test, ],
    exact = T, x = x[train, ], y = y[train])
mean((ridge.pred - y.test)^2)

```

In general, instead of arbitrarily choosing a value for lambda, it would be better to use cross-validation to choose the tuning parameter. We can do this using the built in cross-validation function.

The `glmnet` package have a function `cv.glmnet()` that will do the cross validation. By default the function performs ten-fold cross-validation, though this can be changed using the argument `nfold`.

```{r}
set.seed(1)
cv.ridge<- cv.glmnet(x[train, ], y[train], alpha = 0)
plot(cv.ridge)

```

WE can see here that with higher values of lambda the mean squares errors are too high (the restricted coefficients are too small to fit the model properly). The vertical lines shows the minimum on the left and the other vertical line is at one standard error from the minimum. The top x axis shows the number of variables in the model. In this case what the graph is showing is that the model without adjustment with the full coefficients for all the variables is doing quite well to fit the model.

The best value for lambda that results in the smallest cross-validation error is 326. What is the MSE associated with this value of lambda?

```{r}
bestlam <- cv.ridge$lambda.min
bestlam
ridge.pred <- predict(ridge.mod, s = bestlam,
    newx = x[test, ])
mean((ridge.pred - y.test)^2)
```

This represents an improvement over lambda 0. Finally we refit our ridge regression model on the full data set, using the value of lambda chosen by cross-validation and examine the coefficient estimates.

```{r}
out <- glmnet(x, y, alpha = 0)
predict(out, type = "coefficients", s = bestlam)[1:20, ]
```
:::

::: exercise-box
Example: lasso

We saw that ridge regression with a wise choice of λ can outperform least squares on the Hitters data set. We now ask whether the lasso can yield either a more accurate or a more interpretable model than ridge regression. In order to fit a lasso model, we once again use the `glmnet()` function; however, this time we use the argument `alpha=1`. Other than that change, we proceed just as we did in fitting a ridge model.

```{r chunk39}
lasso.mod <- glmnet(x[train, ], y[train], alpha = 1,
    lambda = grid)
plot(lasso.mod, xvar="lambda", label= TRUE)
```

We can see from the coefficient plot that depending on the choice of tuning parameter, some of the coefficients will be exactly equal to zero. The top x axes tells us how many variable coefficients are not 0 We now perform cross-validation and compute the associated test error.

```{r chunk40}
set.seed(1)
cv.out <- cv.glmnet(x[train, ], y[train], alpha = 1)
plot(cv.out)
bestlam <- cv.out$lambda.min

#test error:
lasso.pred <- predict(lasso.mod, s = bestlam,
    newx = x[test, ])
mean((lasso.pred - y.test)^2)
```

This is substantially lower than the test set MSE of the null model and of least squares, and very similar to the test MSE of ridge regression with $\lambda$ chosen by cross-validation.

However, the lasso has a substantial advantage over ridge regression in that the resulting coefficient estimates are sparse. Here we see that 8 of the 19 coefficient estimates are exactly zero. So the lasso model with $\lambda$ chosen by cross-validation contains only eleven variables.

We now run the model over the full dataset:

```{r }
out <- glmnet(x, y, alpha = 1, lambda = grid)
lasso.coef <- predict(out, type = "coefficients",
    s = bestlam)[1:20, ]
lasso.coef[lasso.coef != 0]
```
:::

# Dimension Reduction Methods

To recapitulate, we have seen subset selection methods, where we were just taking a subset of the predictors and using least squares models choosing the best model. Then ridge regression and lasso, where we were taking all of the predictors, but it does not use least squares, but a shrinkage approach to fit the model. Now we are going to use least squares but not on the original predictors, instead we are going to come up with new predictors, which are linear combinations of the original ones.

Let $Z_1,Z_2,\dots,Z_M$ represent linear combination of some of the $p$ original predictors and use them to fit a linear regression model:

$$
y_i=\theta_0 + \sum_{m=1}^M \theta_mz_{im}+\epsilon_i
$$ {#eq-dimensionReductionFunction}

**Principal Components Regression (PCR)**

The first principal component is that (normalized) linear combination of the variables with the largest variance. The second principal component has largest variance, subject to being uncorrelated with the first etc.

**Partial Least Squares (PLS)** Like PCR, PLS is a dimension reduction method, which first identifies a new set of features that are linear combinations of the original features, and then fits a linear model. After standardizing the $p$ predictors, PLS computes the first direction $Z_1$ by setting each $\phi_{ij}$ equal to the coefficient from the simple linear regression of Y onto $X_j$, hence, PLS places the highest weight on the variables that are most strongly related to the response. Subsequent directions (Z) are found by taking residuals and then repeating the above process.

# Multiple hypothesis testing

[In the Common Errors in significance testing](#testErrors) we saw that rejecting $H_0$ if the $p$-value is below a certain $\alpha$ threshold is a simple way to control Type I error. But now suppose that we wish to test $m$ null hypothesis. If we reject all null hypothesis for which the $p$-value falls bellow our $\alpha$, then how many Type I errors should we expect to make?

Suppose that we flip 1,024 fair coins ten times each. Then we would expect (on average) one coin to come up *all tails*. (There's a $\frac{1}{2}^{10} = \frac{1}{1,024}$ chance that any single coin will come up all tails. So if we flip 1,024 coins, then we expect one coin to come up all tails, on average). If one of our coins comes up all tails, then we might therefore conclude that this particular coin is not fair. In fact, a standard hypothesis test for the null hypothesis that this particular coin is fair would lead to a p-value below $0.002^{12}$ But it would be incorrect to conclude that the coin is not fair: in fact, the null hypothesis holds, and we just happen to have gotten ten tails in a row by chance.

```{r, fig.align='center'}
# Load necessary libraries
library(tidyverse)

# Set seed for reproducibility
set.seed(1)

# Function to simulate coin flips
flip_coins <- function(num_coins, num_flips) {
  results <- matrix(sample(c("H", "T"), num_coins * num_flips, replace = TRUE), ncol = num_flips)
  apply(results, 1, function(coin) all(coin == "T"))
}

# Parameters
num_coins <- 1024
num_flips <- 10

# Simulate the coin flips
all_tails <- flip_coins(num_coins, num_flips)

# Count how many coins came up all tails
num_all_tails <- sum(all_tails)

# Probability of getting all tails in 10 flips of a fair coin
prob_all_tails <- 0.5^10

# Output results
cat("Number of coins that came up all tails:", num_all_tails, "\n")

# Visualize the distribution of results
results_df <- data.frame(
  Coin = 1:num_coins,
  All_Tails = all_tails
)

ggplot(results_df, aes(x = Coin, y = as.numeric(All_Tails))) +
  geom_point() +
  labs(title = "Distribution of Coins Coming Up All Tails",
       x = "Coin Number",
       y = "All Tails (1 = Yes, 0 = No)") +
  theme_minimal()
```

When testing a huge number of null hypotheses, we are bound to get some very small p-values by chance. If we make a decision about whether to reject each null hypothesis without accounting for the fact that we have performed a very large number of tests, then we may end up rejecting a great number of true null hypotheses.

If we choose to reject the null hypothesis when the $p$-value is less than 0.01, then there is a 1% chance of making a false rejection. So if we have $m$ number of hypotheses we want to test, our probability of falsely rejecting the null hypothesis will be $m\times 0.01$ and that will be our number of Type I errors. How can we control fot this situation?

## Family-Wise Error Rate (FWER)

The family-wise rate is the probability of making *at least* one Type I error when conducting hypothesis tests.

When performing the test, we know how many hypotheses we have tested $m$, we also know how many $H_0$ we have rejected, but we don't know if we rejected them correctly or not, but we can approximate that. $$
FWER = 1 - (1 - \alpha)^m
$$

If the null hypothesis is true for each of $m$ independent hypothesis tests, then the FWER is equal to $1-(1-\alpha)^m$. We can use this expression to compute the FWER for $m=1,\ldots, 500$ and $\alpha=0.05$, $0.01$, and $0.001$.

```{r }
m <- 1:500
fwe1 <- 1 - (1 - 0.05)^m
fwe2 <- 1 - (1 - 0.01)^m
fwe3 <- 1 - (1 - 0.001)^m
```

We plot these three vectors. The red, blue, and green lines correspond to $\alpha=0.05$, $0.01$, and $0.001$, respectively.

```{r , fig.align='center'}
par(mfrow = c(1, 1))
plot(m, fwe1, type = "l", log = "x", ylim = c(0, 1), col = 2,
    ylab = "Family - Wise Error Rate",
    xlab = "Number of Hypotheses")
lines(m, fwe2, col = 4)
lines(m, fwe3, col = 3)
abline(h = 0.05, lty = 2)
```

Even for moderate values of $m$ such as $50$, the FWER exceeds $0.05$ unless $\alpha$ is set to a very low value, such as $0.001$. Of course, the problem with setting $\alpha$ to such a low value is that we are likely to make a number of Type II errors: in other words, our power is very low.

## Bonferroni Method for Controlling FWER

The *Bonferroni* method adjusts the significance level (𝛼) for each individual hypothesis test to ensure that the overall probability of making at least one Type I error (false positive) remains controlled. Given a number of hypotheses being tested $m$, we adjust the significance level for each test $\alpha_{adj}$ by dividing the desired significance level $\alpha$ byt the number of hypotheses $m$

$$
\alpha_{adj} = \frac{\alpha}{m}
$$ Now we compare the p-values of each test against the adjusted significance level and reject those null hypothesis for any test with a $p$-value less than or equal to $\alpha_{adj}$

We now consider the Fund dataset. It contains data for 50 months for a list of managers and their excess return, this is, the percentage points difference between the funds managed by these people compared with the stock market in general for the same period. In the table below we show the stats for the first 5 managers in our list.

If we control the Type I error at level α = 0.05 for each fund manager separately, then we will conclude that the first and third managers have significantly non-zero excess returns; in other words, we will reject H01 : μ1 = 0 and H03 : μ3 = 0. However, as discussed in previous sections, this procedure does not account for the fact that we have tested multiple hypotheses, and therefore it will lead to a FWER greater than 0.05. If we instead wish to control the FWER at level 0.05, then, using a Bonferroni correction, we must control the Type I error for each individual manager at level α/m = 0.05/5 = 0.01. Consequently, we will reject the null hypothesis only for the first manager, since the p-values for all other managers exceed 0.01.

We are going to do the calculation manually here, and later in the exercises we will use an `R` function that calculate this for us.

```{r}
fund <- ISLR2::Fund
managers <- numeric(0)
pvals <- numeric(0)
tstat <- numeric(0)
means <- numeric(0)
sds <- numeric(0)
for (i in 1:5){
  testResult <- t.test(fund[,i], mu = 0) 
managers <- c(managers, i)
pvals<- c(pvals, round(testResult$p.value,3))
tstat<- c(tstat, round(testResult$statistic,2))  
means<- c(means, round(mean(fund[,i]),1))
sds <- c(sds, round(sd(fund[,i]),1))
}
df<- data.frame(manager = managers, mean = means, StandDev = sds,  tStatistic = tstat,pValues = pvals)
df

```

The Bonferroni correction is by far the best-known and most commonly used multiplicity correction in all of statistics. Its ubiquity is due in large part to the fact that it is very easy to understand and simple to implement, and also from the fact that it successfully controls Type I error regardless of whether the m hypothesis tests are independent. However, as we will see, it is typically neither the most powerful nor the best approach for multiple testing correction. In particular, the Bonferroni correction can be quite conservative, in the sense that the true FWER is often quite a bit lower than the nominal (or target) FWER.

## Holm's Method for controlling FWER

For this method we start like with the Bonferroni, we just do our test and calculate our $p$-values as per normal, and then we sort those $p$-values from smallest to greatest, and one by one we calculate L that validates this formula

$$
L = \min \left\{ j : p_{j} > \frac{\alpha}{m+1-j} \right\}
$$ where $j$ is the index of the hypotheses tested, so $j= 1,2,\dots,m$

Once we have L, we reject all null hypothesis that are below L.

```{r}
df <- df %>% arrange(df$pValues)
alpha <- 0.05
m = 5
j=1

for (i in df$pValues){
  df$reject = df$pValues < (alpha/(m+1-j))
  j=j+1
  
}
df
```

We see that in this setting, we reject the null hypothesis for managers 1 and 3 using Holm's method.

Holm's method will reject at least as many hypothesis than the Bonferroni method, but usually rejects more.

::: exercise-box
Practice: Type I errors, FWER, Bonferrony, Holm's

We are going to simulate some random data to test 100 hypotheses to check if their mean is equal to 0. Each dataset contains 10 observations.

We create the data and then modify the first 50 columns so its mean is not 0 but 0.5, so we would expect to reject the first 50 $H_0$ hypotheses and do not reject the last 50 $H_0$ hypotheses.

```{r}
set.seed(6)
X<- matrix(rnorm(10*100), 10,100)
X[,1:50]<- X[,1:50] +0.5

p.values = rep(0,100)
for (i in 1:100)
  p.values[i] <- t.test(X[,i],mu=0)$p.value
decision <- rep("Do not reject H0", 100)
decision[p.values <= .05]<- 'reject H0'
table(decision,
  c(rep("H0 is False", 50), rep("H0 is true",50)))
```

The situation in which we rejected the null hypothesis when it is in fact true are our Type I errors, in our case 3.

With only 10 observations, we could only reject 10 out of the 50 false null hypotheses, specially because our mean was only 0.5 difference. We can see how if we change the mean of the first 50 hypotheses to 1 instead of 0.5, we have a better success on the number of Type II errors:

```{r}
X<- matrix(rnorm(10*100), 10,100)
X[,1:50]<- X[,1:50] +1

p.values = rep(0,100)
for (i in 1:100)
  p.values[i] <- t.test(X[,i],mu=0)$p.value
decision <- rep("Do not reject H0", 100)
decision[p.values <= .05]<- 'reject H0'
table(decision,
  c(rep("H0 is False", 50), rep("H0 is true",50)))

```

As we have done in the lecture above, we now conduct a one-sample $t$-test for each of the first five managers in the `Fund` dataset, in order to test the null hypothesis that the $j$th fund manager's mean return equals zero, $H_{0j}: \mu_j=0$.

```{r}
fund.mini <- Fund[, 1:5]
t.test(fund.mini[, 1], mu = 0)
fund.pvalue <- rep(0, 5)
for (i in 1:5)
  fund.pvalue[i] <- t.test(fund.mini[, i], mu = 0)$p.value
fund.pvalue
```

The $p$-values are low for Managers One and Three, and high for the other three managers. However, we cannot simply reject $H_{01}$ and $H_{03}$, since this would fail to account for the multiple testing that we have performed. Instead, we will conduct Bonferroni's method and Holm's method to control the FWER.

**Bonferroni**

To do this, we use the `p.adjust()` function. Given the $p$-values, the function outputs *adjusted* $p$-values, which can be thought of as a new set of $p$-values that have been corrected for multiple testing. If the adjusted $p$-value for a given hypothesis is less than or equal to $\alpha$, then that hypothesis can be rejected while maintaining a FWER of no more than $\alpha$. In other words, the adjusted $p$-values resulting from the `p.adjust()` function can simply be compared to the desired FWER in order to determine whether or not to reject each hypothesis.

For example, in the case of Bonferroni's method, the raw $p$-values are multiplied by the total number of hypotheses, $m$, in order to obtain the adjusted $p$-values. (However, adjusted $p$-values are not allowed to exceed $1$.)

```{r}
p.adjust(fund.pvalue, method = "bonferroni")
pmin(fund.pvalue * 5, 1)
```

Therefore, using Bonferroni's method, we are able to reject the null hypothesis only for Manager One while controlling the FWER at $0.05$.

**Holm's method**

By contrast, using Holm's method, the adjusted $p$-values indicate that we can reject the null hypotheses for Managers One and Three at a FWER of $0.05$.

```{r}
p.adjust(fund.pvalue, method = "holm")
```

As discussed previously, Manager One seems to perform particularly well, whereas Manager Two has poor performance.

```{r }
apply(fund.mini, 2, mean)
```

**Tukey's Honest Significant Difference (HSD) test**

Is there evidence of a meaningful difference in performance between these two managers? Performing a paired $t$-test using the `t.test()` function results in a $p$-value of $0.038$, suggesting a statistically significant difference.

```{r }
t.test(fund.mini[, 1], fund.mini[, 2], paired = T)
```

However, we decided to perform this test only after examining the data and noting that Managers One and Two had the highest and lowest mean performances.

In a sense, this means that we have implicitly performed ${5 \choose 2} = 5(5-1)/2=10$ hypothesis tests, rather than just one. Hence, we use the `TukeyHSD()` function to apply Tukey's method in order to adjust for multiple testing. This function takes as input the output of an *ANOVA* regression model, which is essentially just a linear regression in which all of the predictors are qualitative. In this case, the response consists of the monthly excess returns achieved by each manager, and the predictor indicates the manager to which each return corresponds.

```{r chunk13}
returns <- as.vector(as.matrix(fund.mini))
manager <- rep(c("1", "2", "3", "4", "5"), rep(50, 5))
a1 <- aov(returns ~ manager)
TukeyHSD(x = a1)
```

The `TukeyHSD()` function provides confidence intervals for the difference between each pair of managers (`lwr` and `upr`), as well as a $p$-value. All of these quantities have been adjusted for multiple testing. Notice that the $p$-value for the difference between Managers One and Two has increased from $0.038$ to $0.186$, so there is no longer clear evidence of a difference between the managers' performances. We can plot the confidence intervals for the pairwise comparisons using the `plot()` function.

```{r, fig.align='center'}
plot(TukeyHSD(x = a1))
```
:::

## Benjamini-Hochberg Procedure to control

**The False Discovery Rate** When $m$ is large, then trying to prevent any false positives (as in FWER control) is simply too stringent. Instead, we might try to make sure that the ratio of false positives to total positives (is sufficiently low, so that most of the rejected null hypotheses are not false positives. The ratio is known as the false discovery proportion (FDP).

It might be tempting to ask the data analyst to control the FDP: to make sure that no more than, say, 20% of the rejected null hypotheses are false positives. However, in practice, controlling the FDP is an impossible task for the data analyst, since she has no way to be certain, on any particular dataset, which hypotheses are true and which are false, Therefore, we instead control the false discovery rate (FDR) defined as FDR = E(FDP) = E(False Rejections/Total Rejections) When we control the FDR at (say) level q = 20%, we are rejecting as many null hypotheses as possible while guaranteeing that no more than 20% of those rejected null hypotheses are false positives, on average. The expectation is taken over the population from which the data are generated. For instance, suppose we control the FDR for $m$ null hypotheses at $q = 0.2$. This means that if we repeat this experiment a huge number of times, and each time control the FDR at $q = 0.2$, then we should expect that, on average, 20% of the rejected null hypotheses will be false positives. On a given dataset, the fraction of false positives among the rejected hypotheses may be greater than or less than 20%.

It is worth noting that unlike $p$-values, for which a threshold of 0.05 is typically viewed as the minimum standard of evidence for a "positive" result, and a threshold of 0.01 or even 0.001 is viewed as much more compelling, there is no standard accepted threshold for FDR control. Instead, the choice of FDR threshold is typically context-dependent, or even dataset dependent.

This method is useful when we have a large number of hypothesis to test and we are ok with rejecting a percentage of true null Hypotheses.

We need some way to connect the $p$-values, $p1,\dots,pm$, from the $m$ null hypotheses to the desired FDR value, $q$. It turns out that a very simple procedure can be used to control the FDR

-   First we specify $q$ that is the level at which we wish to control FDR

-   Compute the $p$-values for each of our hypotheses - Order the $p$-values from smallest to greatest - Define L where $$
    L= max \{ j:p_{(j)} < qj/m\}
    $$

-   Reject all null hypotheses $H_{0j}$ for which $p_j \leq p_{(L)}$

We are going to use now the full dataset of Fund with 2000 managers and apply the *Bonferroni* method and the *Benjamini-Hochberg* method. First, we are going to do the calculation manually and later we will use a function.

```{r, fig.align='center', fig.height=6}
fund <- ISLR2::Fund
managers <- numeric(0)
pvals <- numeric(0)
bonfRej <- numeric(0)
alpha = 0.05
m= length(fund)

#bonferroni rejections
for (i in 1:m){
  testResult <- t.test(fund[,i], mu = 0) 
managers <- c(managers, i)
pvals<- c(pvals, round(testResult$p.value,3))
bonf <- testResult$p.value <= (alpha/m)
bonfRej <- c(bonfRej, bonf)
}
df<- data.frame(manager = managers, pValues = pvals, bonfRej = bonfRej)


#Benjamini-Hochberg rejections

ps <- sort(df$pValues)
q <- 0.1
wh.ps <- which(ps < q * (1:m) / m)
if (length(wh.ps) >0) {
  wh <- 1:max(wh.ps)
 } else {
  wh <- numeric(0)
 }


plot(ps, log = "xy", ylim = c(4e-6, 1), ylab = "P-Value",
    xlab = "Index", main = "")
points(wh, ps[wh], col = 4)
abline(a = 0, b = (q / m), col = 2, untf = TRUE)
abline(h = 0.1 / 2000, col = 3)

sum(df$bonfRej)
max(wh)
```

The graph shows in blue the rejected hypotheses using BH

Due to the large number of $m$, the Bonferroni method does not reject any hypothesis, while Benjamin-Hochberg rejects 147, but we have to remember that we expect a percentage of them will be false rejections at our level of $q$.

Now we will calculate the Benjamini-Hochberg using `p.adjust()` function in R. We will show the first 10

```{r chunk16}
q.values.BH <- p.adjust(df$pValues, method = "BH")
q.values.BH[1:10]
```

The q-values output by the Benjamini-Hochberg procedure can be interpreted as the smallest FDR threshold at which we would reject a particular null hypothesis. For instance, a $q$-value of $0.1$ indicates that we can reject the corresponding null hypothesis at an FDR of $10\%$ or greater, but that we cannot reject the null hypothesis at an FDR below $10\%$.

If we control the FDR at $10\%$, then for how many of the fund managers can we reject $H_{0j}: \mu_j=0$?

```{r chunk17}
sum(q.values.BH <= .1)
```

We find that 147 of the 2,000 fund managers have a $q$-value below 0.1; therefore, we are able to conclude that 147 of the fund managers beat the market at an FDR of $10\%$. Only about 15 ($10\%$ of 147) of these fund managers are likely to be false discoveries.

## Resampling approaches

So far, we have assumed that we want to test some null hypothesis with some statistic $T$ and that we know, or can assume, the distribution of $T$ under $H_0$, this allows us to compute the $p$-value, but what if this theoretical null distribution is unknown?

We are only going to see this approach for a two-sample $t$-Test, and this method cannot be automatically used for other types of tests.

We have two samples $X$ and $Y$ and the $H_0$ is that the means of thos two samples is the same. As we have seen in the central limit theorem if $N$ is big enough, we know that $T$ approximately follows a normal distribution under $H_0$, but if $N$ is small, then we don't know the theoretical null distribution of $T$, so we can can use re sampling and let the computer find out the right distribution.

-   Compute the two-sample $t$-statistic $T$ on the original data $X$ and $Y$.
-   choose a large number (like 1000 or higher) $B$
-   Randomly shuffle the $X$ and $Y$ observations and select two samples $x^*$ and $y^*$
-   Compute a two sample $t$-statistic on the shuffled data $T^*$
-   repeat the two last steps $B$ times.
-   the $p$-value is given by the fraction of the times that the $t$-statistic on the shuffled data exceeded the absolute value, the $t$-statistic on the real data.

$$
 p=\frac{ \sum^B_{b=1}1_{(|T^{*b}| \geq |T|)}}{B}
$$ - now we can compare this calculated $p$-value with that one calculated at the beginning on the original two sample test.

The reason why we work with absolute values for the $T$ is because in this particular case it is a two-sided $t$-test so we are interested in whether $X$ is larger than $Y$ or vice-versa but if it's a one side $t$-test then we can get rid of the absolute values.

::: exercise-box
Practice

Here, we implement the re-sampling approach to hypothesis testing using the `Khan` dataset. We merge together train and test datasets because we are not using that here and we do the same for the response $y$ that gives us the type of cancer.

First, we merge the training and testing data, which results in observations (rows) on $83$ patients for $2{,}308$ columns. Each column represent a gene. So we are going to perform $2{,}308$ hypothesis test here.

```{r }
attach(Khan)
x <- rbind(xtrain, xtest)
y <- c(as.numeric(ytrain), as.numeric(ytest))
dim(x)
table(y)
```

There are four classes of cancer.To simplify this example we are only going to compare two of the 4 classes only. For each gene, we compare the mean expression in the second class (*rhabdomyosarcoma*) to the mean expression in the fourth class (*Burkitt's lymphoma*). We see in the table above that we have 29 observations for class 2 and 25 for class 4.

Performing a standard two-sample $t$-test on the $11$th gene produces a test-statistic of $-2.09$ and an associated $p$-value of $0.0412$, suggesting modest evidence of a difference in mean expression levels between the two cancer types.

```{r }
x <- as.matrix(x)
x1 <- x[which(y == 2), ]
x2 <- x[which(y == 4), ]
n1 <- nrow(x1)
n2 <- nrow(x2)
t.out <- t.test(x1[, 11], x2[, 11], var.equal = TRUE)
TT <- t.out$statistic
TT
t.out$p.value
```

However, this $p$-value relies on the assumption that under the null hypothesis of no difference between the two groups, the test statistic follows a $t$-distribution with $29+25-2=52$ degrees of freedom. Instead of using this theoretical null distribution, we can randomly split the 54 patients into two groups of 29 and 25, and compute a new test statistic. Under the null hypothesis of no difference between the groups, this new test statistic should have the same distribution as our original one. Repeating this process 10,000 times allows us to approximate the null distribution of the test statistic. We compute the fraction of the time that our observed test statistic exceeds the test statistics obtained via re-sampling.

```{r}
set.seed(1)
B <- 10000
Tbs <- rep(NA, B) #initializing an empty vector for the T statistic
for (b in 1:B) {
   dat <- sample(c(x1[, 11], x2[, 11])) #reshufle the 11th gene for the two types of cancer
   Tbs[b] <- t.test(dat[1:n1], 
                    dat[(n1 + 1):(n1 + n2)],
        var.equal = TRUE
      )$statistic
}
mean((abs(Tbs) >= abs(TT))) #calculate how many T's are higher than the original
```

This fraction, $0.0416$, is our re-sampling-based $p$-value. It is almost identical to the $p$-value of $0.0412$ obtained using the theoretical null distribution. This means that the assumptions we made in the original test hold for this data

We can plot a histogram of the re-sampling-based test statistics.

```{r fig.align='center'}
hist(Tbs, breaks = 100, xlim = c(-4.2, 4.2), main = "",
    xlab = "Null Distribution of Test Statistic", col = 7)

lines(seq(-4.2, 4.2, len = 1000),
      dt(seq(-4.2, 4.2, len = 1000),
      df = (n1 + n2 - 2)
      ) * 1000, col = 2, lwd = 3)
abline(v = TT, col = 4, lwd = 2)
text(TT + 0.5, 350, paste("T = ", round(TT, 4), sep = ""),
    col = 4)
```

In yellow we have the resampling null distribution and in red the theoretical $t$ distribution for mean 0, sd 1 and 52 degrees of freedom.

The re-sampling-based null distribution is almost identical to the theoretical null distribution, which is displayed in red, so it is to no surprise that the theoretical $p$-value from the test is almost the same as the resampling $p$-value. This just confirms that it is reasonable to use the $t$-distribution here.

Finally, we implement the plug-in re-sampling FDR approach outlined. Depending on the speed of your computer, calculating the FDR for all 2,308 genes in the `Khan` dataset may take a while. Hence, we will illustrate the approach on a random subset of 100 genes. For each gene, we first compute the observed test statistic, and then produce $10{,}000$ re-sampled test statistics. This may take a few minutes to run. If you are in a rush, then you could set `B` equal to a smaller value (e.g. `B = 500`).

```{r}
m <- 100
B <- 1000
set.seed(1)
index <- sample(ncol(x1), m)
Ts <- rep(NA, m)
Ts.star <- matrix(NA, ncol = m, nrow = B)
for (j in 1:m) {
  k <- index[j]
  Ts[j] <- t.test(x1[, k], x2[, k],
        var.equal = TRUE
      )$statistic
  for (b in 1:B) {
    dat <- sample(c(x1[, k], x2[, k]))
    Ts.star[b, j] <- t.test(dat[1:n1],
                           dat[(n1 + 1):(n1 + n2)], 
                           var.equal = TRUE
       )$statistic
  }
}
```

Next, we compute the number of rejected null hypotheses $R$, the estimated number of false positives $\widehat{V}$, and the estimated FDR, for a range of threshold values $c$. The threshold values are chosen using the absolute values of the test statistics from the $100$ genes.

```{r}
cs <- sort(abs(Ts))
FDRs <- Rs <- Vs <- rep(NA, m)
for (j in 1:m) {
  R <- sum(abs(Ts) >= cs[j])
  V <- sum(abs(Ts.star) >= cs[j]) / B
  Rs[j] <- R
  Vs[j] <- V
  FDRs[j] <- V / R
}
```

Now, for any given FDR, we can find the genes that will be rejected. For example, with the FDR controlled at 0.1, we reject 15 of the 100 null hypotheses. On average, we would expect about one or two of these genes (i.e. $10\%$ of 15) to be false discoveries.

At an FDR of $0.2$, we can reject the null hypothesis for $28$ genes, of which we expect around six to be false discoveries. The variable `index` is needed here since we restricted our analysis to just $100$ randomly-selected genes.

```{r chunk27}
max(Rs[FDRs <= .1])
sort(index[abs(Ts) >= min(cs[FDRs < .1])])
max(Rs[FDRs <= .2])
sort(index[abs(Ts) >= min(cs[FDRs < .2])])
```

```{r fig.align='center'}
plot(Rs, FDRs, xlab = "Number of Rejections", type = "l",
    ylab = "False Discovery Rate", col = 4, lwd = 3)
```

As noted in the chapter, much more efficient implementations of the re-sampling approach to FDR calculation are available, using e.g. the `samr` package in `R`.
:::